ML Infrastructure Engineer

JetBrains
17 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Tech stack

Airflow
Amazon Web Services (AWS)
Azure
Linux
Distributed Systems
InfiniBand
Role-Based Access Control
Ansible
Prometheus
Management of Software Versions
Reinforcement Learning
Data Logging
Grafana
Kubernetes
Slurm
Machine Learning Operations
Terraform

Job description

We're looking for the ML Infrastructure Engineer to manage our research infrastructure and boost its efficiency and scalability. This role involves maintaining GPU clusters and standalone machines, building robust monitoring, and removing bottlenecks that slow down experimentation. You will work alongside ML researchers and engineers to solve practical problems across a wide range of projects involving distributed training, reinforcement learning, training agentic backbones, and more.

In this role, you will:

  • Design, operate, and continuously improve our Kubernetes GPU cluster, including NVIDIA drivers, MIG, and high-speed networking (InfiniBand/NVLink).
  • Manage and tune our job orchestrator (Ray) so researchers can launch distributed training and benchmarking jobs with minimal friction.
  • Implement robust monitoring, logging, and alerting with Prometheus, Thanos, Loki, and Grafana to track resource utilization and optimize costs.
  • Identify and resolve infrastructure bottlenecks to maximize GPU utilization.
  • Collaborate closely with our SRE, IT, and Security teams to ensure our research environment integrates smoothly with company standards.
  • Educate and support researchers on the infrastructure-related topics and troubleshoot ad-hoc requests.

Requirements

Do you have experience in Terraform?, * Proven hands-on experience administering Kubernetes clusters (control-plane operations, RBAC, CNI, storage, upgrades).

  • Solid Linux fundamentals, including networking, containers, and troubleshooting.
  • Experience diagnosing performance issues in distributed systems and setting up monitoring for them.
  • Strong communication skills to work collaboratively with engineers and researchers of different backgrounds.

We would be especially thrilled if you:

  • Have worked with GPU clusters.
  • Have MLOps experience, including training pipelines, experiment tracking, and data versioning.
  • Are familiar with infrastructure-as-code (Terraform, Ansible, or similar).
  • Have worked with Ray, Slurm, Flyte, Airflow, or other workload orchestrators.
  • Understand cloud platforms (AWS, GCP, or Azure) and hybrid/on-premises networking setups.

About the company

At JetBrains, code is our passion. Since 2000, we have strived to make the most effective developer tools on earth. By automating routine checks and corrections, our tools speed up production, freeing developers to grow, discover, and create. The ML Research team at JetBrains applies machine learning - in particular reinforcement learning, agentic approaches, and federated learning - to help developers and enhance the software development process. At the heart of our work is a fast, flexible, and reliable infrastructure designed to run and scale experiments efficiently.

Apply for this position