Senior Site Reliability Engineer Gpu & Ml Infrastructure H/F
Role details
Job location
Tech stack
Job description
At Criteo, the Platform Core group builds the foundational infrastructure powering our global advertising platform. We design and operate large-scale, resilient systems supporting real-time decision-making and data processing across thousands of services.
As we expand our distributed computing and ML infrastructure capabilities, we are building a new team focused on GPU platforms and high-performance model serving technologies.
As a Site Reliability Engineer in the GPU team, you will help design, operate, and scale the infrastructure powering machine learning training and inference workloads.
You will work on technologies such as:
Ray on Kubernetes
- Build and operate scalable Ray clusters running on Kubernetes.
- Develop reliable self-service distributed computing platforms for ML workloads.
- Improve provisioning, observability, reliability, and operational efficiency of ray-as-a-service environments.
NVIDIA Triton Inference Server
- Operate and optimize large-scale inference platforms using Triton.
- Improve latency, throughput, scalability, and GPU utilization for deep learning inference workloads.
You will collaborate closely with ML engineers, data scientists, and infrastructure teams to deliver reliable, production-grade ML platforms accelerating innovation across Criteo.
Requirements
5+ years of experience in backend engineering, Site Reliability Engineering, or platform engineering roles focused on distributed systems.
- Strong experience with Kubernetes, including workload scheduling, dynamic provisioning, and custom controllers/operators.
- Hands-on experience running or optimizing GPU-based workloads in production, ideally for ML training or inference systems.
- Strong software engineering skills in C#, Python, Go, or similar languages, with a focus on building reliable distributed systems.
- Experience building or operating production-grade infrastructure with strong requirements around performance, scalability, and reliability.
- Strong interest in automation, observability, and designing systems that scale efficiently under high load.
Bonus Points
- Experience with distributed ML frameworks such as Ray or similar systems.
- Familiarity with inference serving stacks such as NVIDIA Triton or TensorRT.
- Experience with GPU scheduling, resource management, or multi-tenant GPU platforms.
- Exposure to cloud-native GPU orchestration (GKE, EKS, or on-prem Kubernetes GPU clusters).