Research Infrastructure Engineer, Training Systems

OpenAI Inc.
San Francisco, United States of America
4 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Compensation
$ 295K

Job location

San Francisco, United States of America

Tech stack

API
Systems Engineering
Software Debugging
Distributed Systems
Design of User Interfaces
Python
Machine Learning
Graphics Processing Unit (GPU)
PyTorch
Data Pipelines

Job description

This is a systems engineering role focused on ML training infrastructure. You will work on the systems layer that turns novel research ideas into runnable, measurable training workloads for large models. The work can sit on the critical path for model releases, bringing both the excitement of direct impact and the responsibility of building systems that remain reliable under real pressure.

In This Role, You Will

  • Build and maintain infrastructure for large-scale model training and experimentation.
  • Design APIs and interfaces that make complex training workflows easier to express and harder to misuse.
  • Improve reliability, debuggability, and performance across training and data pipelines.
  • Debug issues spanning Python, PyTorch, distributed systems, GPUs, networking, and storage.
  • Write tests, benchmarks, and diagnostics that catch meaningful regressions.

Requirements

  • You have strong systems instincts and care deeply about performance, reliability, and clean abstractions.
  • You have good taste in API and interface design, with empathy for the researchers and engineers using your tools.
  • You are comfortable working across ML research code and production-quality infrastructure.
  • You enjoy debugging from evidence: profiles, traces, logs, tests, and minimal reproductions.

Benefits & conditions

$295K - $380K medical insurance, dental insurance, vision insurance, parental leave, paid time off, paid holidays, 401(k), retirement plan United States, California, San Francisco Apr 28, 2026

About The Team

The team works on research and systems that advance frontier models. Our work often goes beyond standard training recipes, which means we also build the infrastructure needed to make new training approaches practical at scale. This is a team where systems work is directly tied to research progress: better tools, abstractions, and runtimes can unlock experiments that would otherwise be too slow, brittle, or difficult to express.

About the company

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity., At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.

Apply for this position