ML Infrastructure Engineer
Role details
Job location
Tech stack
Job description
You'll be joining the team that powers the core of their research. This isn't a support role. This is the group that builds the compute backbone behind every major breakthrough. You'll shape how their scientists train models, test ideas, and push their work forward at scale. And because you're joining early, your impact will be felt across the whole organisation.
You'll work on problems that matter. You'll help build fast, reliable GPU systems that let researchers move from idea to result without friction. You'll have room to experiment, try new approaches, and design systems in a place that backs bold thinking., * Build, run, and improve high-performance GPU training and inference clusters with a focus on reliability and automation
- Design and implement high-throughput data paths, including work on caching, I/O, and data locality across compute and storage
- Benchmark, profile, and fix performance issues across compute, network, and orchestration layers
- Set up clear observability, resilience, and security controls for sensitive research environments
- Work with Research, Data, and Applied teams to plan GPU and storage capacity and support smoother ML experimentation
Requirements
- Strong experience designing and operating large-scale ML compute clusters
- Good understanding of GPU architecture, high-speed networking, and performance tuning for distributed training
- Experience with modern containerised systems and migrations from traditional schedulers
- Knowledge of high-throughput storage systems for ML or HPC workloads
- Solid experience with IaC and CI/CD (Terraform, Argo CD, or similar)
Benefits & conditions
- Salary packages competitive with FAANG businesses
- An opportunity to work on projects that will make a difference in the world, all projects are multi-decade programs that are orientated to improve society and people's lives
- A rare opportunity to take part in shaping the core ML infra team as it grows from the ground up
- State-of-the-art resources, enabling you to push the boundaries of AI research and development quickly and ethically, * Enhanced holiday pay
- Pension
- Life Assurance
- Income Protection
- Private Medical Insurance
- Hospital Cash Plan
- Therapy Services
- Perk Box
- Electric Car Scheme