Senior Site Reliability Engineer

Hamilton Barnes ?
Municipality of Madrid, Spain
6 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Municipality of Madrid, Spain

Tech stack

Artificial Intelligence
Build Automation
Bash
Data Centers
Linux
DevOps
Distributed Systems
Python
Performance Tuning
Reliability Engineering
Prometheus
Data Streaming
Cloud Platform System
Grafana
Kubernetes
Information Technology
Hardware Acceleration
Slurm

Job description

Senior Site Reliability Engineer - EU Wide - Remote Join a stealth-mode hyperscale data centre start-up building an AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready for experimentation, full-scale model training or inference. As a Senior Site Reliability Engineer, you'll own the reliability, performance and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes or direct SSH access. This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Interested in finding out more - Apply today Responsibilities * Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training

Requirements

and inference workloads. * Build automation pipelines for provisioning, scaling and monitoring compute resources across Slurm and Kubernetes environments. * Develop observability, alerting and auto-healing systems for high-availability GPU workloads. * Collaborate with ML, networking and platform teams to optimise resource scheduling, GPU utilisation and data flow. * Implement infrastructure-as-code, CI/CD pipelines and reliability standards across thousands of nodes. * Diagnose performance bottlenecks and drive continuous improvements in reliability, latency and throughput. Skills / Must Have * 7+ years of experience in SRE, DevOps or Infrastructure Engineering roles supporting large-scale compute environments. * Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management. * Deep knowledge of Linux systems, networking and GPU infrastructure (NVIDIA H100/H200/B200 preferred). * Proficiency in Python, Go or Bash for automation, tooling and performance tuning. * Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. * Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale. * Background in reliability engineering, distributed systems or hardware acceleration environments is a strong plus. Seniority level: Mid-Senior level Employment type: Full-time Job function: Information Technology Location: The Hague, South Holland, Netherlands #J-18808-Ljbffr

Apply for this position