Research Infrastructure Engineer, Training Systems

OpenAI Inc.

San Francisco, United States of America

4 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Compensation

$ 295K

Job location

San Francisco, United States of America

Tech stack

API

Systems Engineering

Software Debugging

Distributed Systems

Design of User Interfaces

Python

Machine Learning

Graphics Processing Unit (GPU)

PyTorch

Data Pipelines

Job description

This is a systems engineering role focused on ML training infrastructure. You will work on the systems layer that turns novel research ideas into runnable, measurable training workloads for large models. The work can sit on the critical path for model releases, bringing both the excitement of direct impact and the responsibility of building systems that remain reliable under real pressure.

In This Role, You Will

Build and maintain infrastructure for large-scale model training and experimentation.
Design APIs and interfaces that make complex training workflows easier to express and harder to misuse.
Improve reliability, debuggability, and performance across training and data pipelines.
Debug issues spanning Python, PyTorch, distributed systems, GPUs, networking, and storage.
Write tests, benchmarks, and diagnostics that catch meaningful regressions.

Requirements

You have strong systems instincts and care deeply about performance, reliability, and clean abstractions.
You have good taste in API and interface design, with empathy for the researchers and engineers using your tools.
You are comfortable working across ML research code and production-quality infrastructure.
You enjoy debugging from evidence: profiles, traces, logs, tests, and minimal reproductions.

Benefits & conditions

$295K - $380K medical insurance, dental insurance, vision insurance, parental leave, paid time off, paid holidays, 401(k), retirement plan United States, California, San Francisco Apr 28, 2026

About The Team

The team works on research and systems that advance frontier models. Our work often goes beyond standard training recipes, which means we also build the infrastructure needed to make new training approaches practical at scale. This is a team where systems work is directly tied to research progress: better tools, abstractions, and runtimes can unlock experiments that would otherwise be too slow, brittle, or difficult to express.

About the company

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity., At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.