Machine Learning Infrastructure Engineer

Cubiq Recruitment
Oxford, United Kingdom
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Oxford, United Kingdom

Tech stack

Artificial Intelligence
Profiling
Continuous Integration
Distributed File Systems
Graphics Processing Unit (GPU)
Terraform

Job description

The institute is seeking an experienced ML Infrastructure Engineer to join its growing compute and platform engineering team. You'll play a pivotal role in developing and operating the high-performance cloud and compute backbone that powers large-scale machine learning and scientific discovery. This is a hands-on, high-impact role where you'll design and optimise GPU infrastructure, improve performance across compute and storage layers, and ensure scalability, resilience, and security across AI research environments.

What You'll Do

  • Design, deploy, and operate high-performance GPU compute clusters for large-scale ML training and inference.
  • Engineer reliable, high-throughput data paths, optimising I/O performance, caching, and storage locality.
  • Benchmark and troubleshoot compute, network, and orchestration bottlenecks to maximise performance.
  • Implement observability, automation, and security practices that support compliant, resilient environments.
  • Collaborate with research and data teams to forecast capacity, manage resources, and streamline ML experimentation pipelines.
  • Support the transition from traditional HPC systems to modern, containerised, and cloud-native infrastructure.

Requirements

  • Proven experience designing, building, and maintaining large-scale ML or HPC compute infrastructure.
  • Deep understanding of GPU architecture, distributed training, and high-speed networking.
  • Expertise with high-throughput or parallel storage systems for ML/HPC workloads.
  • Solid grasp of IaC and CI/CD tooling (e.g. Terraform, Argo CD).
  • Proactive, self-directed approach with strong systems design and problem-solving skills.

Nice to Have

  • Familiarity with Lustre or similar distributed file systems.
  • Experience with performance benchmarking, profiling, and cost optimisation.
  • Background in scientific or research computing environments.

Benefits & conditions

  • Help build the infrastructure driving breakthrough AI and scientific research.
  • Work in a collaborative, forward-thinking environment that values innovation and inclusion.
  • Access modern facilities and advanced technology in a rapidly growing institute.
  • Competitive salary and comprehensive benefits, including:
  • Enhanced holiday pay
  • Pension, life assurance, and income protection
  • Private medical insurance and therapy services
  • Electric car scheme and wellbeing perks

About the company

This world-leading AI research firm is reimagining how science and innovation translate into real-world impact. Through interdisciplinary collaboration and cutting-edge facilities, the institute develops end-to-end solutions that tackle humanity's most complex challenges, from sustainable agriculture and healthcare to climate change and artificial intelligence. They providing state-of-the-art laboratories and collaborative environments designed to accelerate discovery and application at scale.

Apply for this position