ML Infrastructure Engineer

JetBrains

17 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Tech stack

Airflow

Amazon Web Services (AWS)

Azure

Linux

Distributed Systems

InfiniBand

Role-Based Access Control

Ansible

Prometheus

Management of Software Versions

Reinforcement Learning

Data Logging

Grafana

Kubernetes

Slurm

Machine Learning Operations

Terraform

Job description

We're looking for the ML Infrastructure Engineer to manage our research infrastructure and boost its efficiency and scalability. This role involves maintaining GPU clusters and standalone machines, building robust monitoring, and removing bottlenecks that slow down experimentation. You will work alongside ML researchers and engineers to solve practical problems across a wide range of projects involving distributed training, reinforcement learning, training agentic backbones, and more.

In this role, you will:

Design, operate, and continuously improve our Kubernetes GPU cluster, including NVIDIA drivers, MIG, and high-speed networking (InfiniBand/NVLink).
Manage and tune our job orchestrator (Ray) so researchers can launch distributed training and benchmarking jobs with minimal friction.
Implement robust monitoring, logging, and alerting with Prometheus, Thanos, Loki, and Grafana to track resource utilization and optimize costs.
Identify and resolve infrastructure bottlenecks to maximize GPU utilization.
Collaborate closely with our SRE, IT, and Security teams to ensure our research environment integrates smoothly with company standards.
Educate and support researchers on the infrastructure-related topics and troubleshoot ad-hoc requests.

Requirements

Do you have experience in Terraform?, * Proven hands-on experience administering Kubernetes clusters (control-plane operations, RBAC, CNI, storage, upgrades).

Solid Linux fundamentals, including networking, containers, and troubleshooting.
Experience diagnosing performance issues in distributed systems and setting up monitoring for them.
Strong communication skills to work collaboratively with engineers and researchers of different backgrounds.

We would be especially thrilled if you:

Have worked with GPU clusters.
Have MLOps experience, including training pipelines, experiment tracking, and data versioning.
Are familiar with infrastructure-as-code (Terraform, Ansible, or similar).
Have worked with Ray, Slurm, Flyte, Airflow, or other workload orchestrators.
Understand cloud platforms (AWS, GCP, or Azure) and hybrid/on-premises networking setups.

About the company

At JetBrains, code is our passion. Since 2000, we have strived to make the most effective developer tools on earth. By automating routine checks and corrections, our tools speed up production, freeing developers to grow, discover, and create. The ML Research team at JetBrains applies machine learning - in particular reinforcement learning, agentic approaches, and federated learning - to help developers and enhance the software development process. At the heart of our work is a fast, flexible, and reliable infrastructure designed to run and scale experiments efficiently.