Big Data Developer

MSR Technology Group LLC

O'Fallon, United States of America

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

O'Fallon, United States of America

Tech stack

Apache HTTP Server

Automation of Tests

Big Data

Code Review

Continuous Integration

ETL

Database Queries

Linux

Distributed Data Store

Distributed Systems

Fault Tolerance

Python

Performance Tuning

Standard Sql

Shell Script

SQL Databases

Data Streaming

Ceph

Data Logging

Data Processing

Data Ingestion

Spark

GIT

PySpark

Information Technology

Low Latency

Kafka

Apache Nifi

Spark Streaming

Stream Processing

Software Version Control

Data Pipelines

Job description

Design, develop, and maintain large scale Spark applications using Scala and PySpark
Build and operate streaming heavy data pipelines using Kafka and Spark Structured Streaming
Implement stateful streaming patterns including windowing, watermarking, late data handling, and checkpointing
Develop robust event replay and reprocessing workflows using Kafka offsets and partitions
Build ingestion and routing flows using Apache NiFi, including Kafka based ingestion patterns
Implement end to end ETL/ELT pipelines with strong emphasis on low latency, fault tolerance, and scalability
Optimize Spark jobs through partitioning strategies, memory tuning, shuffle optimization, and efficient data formats
Integrate Spark workloads with distributed object storage systems such as Apache Ozone and Ceph
Ensure data quality, consistency, and auditability through validation, reconciliation, and metadata capture
Collaborate with platform, infrastructure, and operations teams on production readiness and capacity planning
Support production systems, including monitoring, incident analysis, and root cause resolution
Contribute to reusable frameworks, coding standards, and engineering best practices
Participate in architecture reviews, code reviews, and technical documentation

Requirements

Must Have Technical/Functional Skills

Experience with Apache Ozone and/or Ceph as storage backends for analytics workloads
Experience implementing exactly once / at least once streaming semantics
Strong background in Spark performance tuning (CPU, memory, I/O, shuffle)
Experience supporting mission critical production systems with strict SLAs
Familiarity with CI/CD pipelines and automated testing for data applications
Experience designing observability for streaming systems (lag, throughput, backpressure)

Technical Skills

Languages: Scala, Python (PySpark), SQL
Big Data: Apache Spark (Core, SQL, Structured Streaming)
Streaming: Kafka
Ingestion / Orchestration: Apache NiFi
Storage: Apache Ozone, Ceph, object storage concepts
OS & Tooling: Linux, Git, CI/CD, monitoring and logging tools, * Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
Strong hands on experience with Apache Spark in production environments
Advanced proficiency in Scala and PySpark
Solid understanding of distributed systems and data processing at scale
Strong experience with Kafka based streaming architectures
Hands on experience with Spark Structured Streaming
Experience building batch and real time pipelines
Hands on experience with Apache NiFi for data ingestion and flow management
Strong SQL skills and experience working with structured and semi structured data
Experience working with object storage or distributed storage platforms
Proficiency with Linux, shell scripting, and Git based version control

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all