ML Infrastructure Engineer
Role details
Job location
Tech stack
Job description
We're looking for the ML Infrastructure Engineer to manage our research infrastructure and boost its efficiency and scalability. This role involves maintaining GPU clusters and standalone machines, building robust monitoring, and removing bottlenecks that slow down experimentation. You will work alongside ML researchers and engineers to solve practical problems across a wide range of projects involving distributed training, reinforcement learning, training agentic backbones, and more.
In this role, you will:
- Design, operate, and continuously improve our Kubernetes GPU cluster, including NVIDIA drivers, MIG, and high-speed networking (InfiniBand/NVLink).
- Manage and tune our job orchestrator (Ray) so researchers can launch distributed training and benchmarking jobs with minimal friction.
- Implement robust monitoring, logging, and alerting with Prometheus, Thanos, Loki, and Grafana to track resource utilization and optimize costs.
- Identify and resolve infrastructure bottlenecks to maximize GPU utilization.
- Collaborate closely with our SRE, IT, and Security teams to ensure our research environment integrates smoothly with company standards.
- Educate and support researchers on the infrastructure-related topics and troubleshoot ad-hoc requests.
Requirements
Do you have experience in Terraform?, * Proven hands-on experience administering Kubernetes clusters (control-plane operations, RBAC, CNI, storage, upgrades).
- Solid Linux fundamentals, including networking, containers, and troubleshooting.
- Experience diagnosing performance issues in distributed systems and setting up monitoring for them.
- Strong communication skills to work collaboratively with engineers and researchers of different backgrounds.
We would be especially thrilled if you:
- Have worked with GPU clusters.
- Have MLOps experience, including training pipelines, experiment tracking, and data versioning.
- Are familiar with infrastructure-as-code (Terraform, Ansible, or similar).
- Have worked with Ray, Slurm, Flyte, Airflow, or other workload orchestrators.
- Understand cloud platforms (AWS, GCP, or Azure) and hybrid/on-premises networking setups.