Big Data Developer

MSR Technology Group LLC
O'Fallon, United States of America
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

O'Fallon, United States of America

Tech stack

Apache HTTP Server
Automation of Tests
Big Data
Code Review
Continuous Integration
ETL
Database Queries
Linux
Distributed Data Store
Distributed Systems
Fault Tolerance
Python
Performance Tuning
Standard Sql
Shell Script
SQL Databases
Data Streaming
Ceph
Data Logging
Data Processing
Data Ingestion
Spark
GIT
PySpark
Information Technology
Low Latency
Kafka
Apache Nifi
Spark Streaming
Stream Processing
Software Version Control
Data Pipelines

Job description

  • Design, develop, and maintain large scale Spark applications using Scala and PySpark
  • Build and operate streaming heavy data pipelines using Kafka and Spark Structured Streaming
  • Implement stateful streaming patterns including windowing, watermarking, late data handling, and checkpointing
  • Develop robust event replay and reprocessing workflows using Kafka offsets and partitions
  • Build ingestion and routing flows using Apache NiFi, including Kafka based ingestion patterns
  • Implement end to end ETL/ELT pipelines with strong emphasis on low latency, fault tolerance, and scalability
  • Optimize Spark jobs through partitioning strategies, memory tuning, shuffle optimization, and efficient data formats
  • Integrate Spark workloads with distributed object storage systems such as Apache Ozone and Ceph
  • Ensure data quality, consistency, and auditability through validation, reconciliation, and metadata capture
  • Collaborate with platform, infrastructure, and operations teams on production readiness and capacity planning
  • Support production systems, including monitoring, incident analysis, and root cause resolution
  • Contribute to reusable frameworks, coding standards, and engineering best practices
  • Participate in architecture reviews, code reviews, and technical documentation

Requirements

Must Have Technical/Functional Skills

  • Experience with Apache Ozone and/or Ceph as storage backends for analytics workloads
  • Experience implementing exactly once / at least once streaming semantics
  • Strong background in Spark performance tuning (CPU, memory, I/O, shuffle)
  • Experience supporting mission critical production systems with strict SLAs
  • Familiarity with CI/CD pipelines and automated testing for data applications
  • Experience designing observability for streaming systems (lag, throughput, backpressure)

Technical Skills

  • Languages: Scala, Python (PySpark), SQL
  • Big Data: Apache Spark (Core, SQL, Structured Streaming)
  • Streaming: Kafka
  • Ingestion / Orchestration: Apache NiFi
  • Storage: Apache Ozone, Ceph, object storage concepts
  • OS & Tooling: Linux, Git, CI/CD, monitoring and logging tools, * Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • Strong hands on experience with Apache Spark in production environments
  • Advanced proficiency in Scala and PySpark
  • Solid understanding of distributed systems and data processing at scale
  • Strong experience with Kafka based streaming architectures
  • Hands on experience with Spark Structured Streaming
  • Experience building batch and real time pipelines
  • Hands on experience with Apache NiFi for data ingestion and flow management
  • Strong SQL skills and experience working with structured and semi structured data
  • Experience working with object storage or distributed storage platforms
  • Proficiency with Linux, shell scripting, and Git based version control

Apply for this position