Senior ML/RL Training Infrastructure Engineer

Apple

Zürich, Switzerland

3 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Zürich, Switzerland

Tech stack

Systems Engineering

Software Quality

Computer Engineering

Continuous Integration

Core Foundation

Software Debugging

Distributed Systems

Fault Tolerance

Python

Machine Learning

Performance Tuning

Software Engineering

Reinforcement Learning

Graphics Processing Unit (GPU)

PyTorch

Information Technology

Job description

Ready to transform how billions of people interact with technology? Apple's Core Foundation Models team is driving the intelligence that powers experiences across billions of devices worldwide-and we're looking for exceptional talent to join us! Join our Europe-based applied ML team building the next generation of large-scale ML and RL training infrastructure for Apple's foundation models. We develop high-performance, distributed systems that power cutting-edge foundation model research on a massive scale. We are seeking an engineer who is passionate about designing, optimizing, and scaling the infrastructure that enables state-of-the-art machine learning and reinforcement learning workloads.

As a senior member of the team, you will work closely with researchers and systems engineers to build robust training frameworks, accelerate experimentation, and push the boundaries of performance and efficiency. You will collaborate with teams across Apple's engineering hubs-including New York, Seattle, and Cupertino-to advance the tooling and systems that make large-scale model training possible. If you thrive at the intersection of distributed systems, ML frameworks, and high-performance computing, this is the role for you., As a core member of our ML infrastructure team, you will design, build, and scale the systems that enable large-scale reinforcement learning for Apple's foundation models. You will focus on TPU-based training with JAX, developing robust, high-performance RL pipelines that support distributed actor/learner architectures, efficient experience replay, and large-scale environment execution.

In this role, you will work across the full stack of RL training systems-from low-level performance tuning and compiler optimization to cluster-level orchestration and resource management. You will ensure that training pipelines are efficient, reliable, reproducible, and observable, enabling research teams to iterate quickly and explore more complex RL environments and models.

Your work will directly impact the scalability, throughput, and stability of RL experiments, helping to unlock new capabilities in agentic reasoning, decision-making, and policy learning for Apple's foundation models. This position is ideal for engineers who enjoy distributed systems, high-performance ML frameworks, and building the infrastructure that makes large-scale RL research possible.

Requirements

Do you have experience in Research?, Do you have a Master's degree?, Practical experience developing or optimizing training loops, RL pipelines, or large-scale model-training frameworks. Strong software engineering skills in Python, with emphasis on reliability, debuggability, and high-performance execution. Deep experience with PyTorch/JAX internals, XLA, debugging and performance profiling on GPU/TPU architectures. Expertise in distributed RL training patterns, including actor/learner architectures, experience replay, and parallel environment execution. Experience building training services, orchestration tools, or automated pipelines for large-scale experiments. Proven success diagnosing bottlenecks in large-scale ML jobs (I/O, input pipelines, kernel performance, memory, compilation). Familiarity with RL-specific infrastructure requirements (e.g., actor/learner architectures, experience replay systems, large-scale environment execution). Strong software engineering practices: code quality, design reviews, testing, observability, CI/CD. Experience working with cloud-scale clusters or specialized accelerators (TPU v5/v6, GPU, custom hardware) Contributions to ML frameworks, distributed training libraries, or high-performance computing systems. Excellent communication and collaboration skills for working with research and engineering partners.

Minimum Qualifications PhD or MSc in Computer Science, Computer Engineering or a closely related field. Hands-on experience designing, building, or maintaining large-scale ML training infrastructure. Strong proficiency with PyTorch or JAX and experience running training workloads on GPUs/TPUs. Solid understanding of distributed systems concepts (parallelism strategies, fault tolerance, synchronization).

About the company

At Apple, we're not all the same. And that's our greatest strength. We draw on the differences in who we are, what we've experienced, and how we think. Because to create products that serve everyone, we believe in including everyone. Therefore, we are committed to treating all applicants fairly and equally. We will work with applicants to make any reasonable accommodations.