Machine Learning Infrastructure Engineer

Cubiq Recruitment

Oxford, United Kingdom

4 months ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Oxford, United Kingdom

Tech stack

Artificial Intelligence

Profiling

Continuous Integration

Distributed File Systems

Graphics Processing Unit (GPU)

Terraform

Job description

The institute is seeking an experienced ML Infrastructure Engineer to join its growing compute and platform engineering team. You'll play a pivotal role in developing and operating the high-performance cloud and compute backbone that powers large-scale machine learning and scientific discovery. This is a hands-on, high-impact role where you'll design and optimise GPU infrastructure, improve performance across compute and storage layers, and ensure scalability, resilience, and security across AI research environments.

What You'll Do

Design, deploy, and operate high-performance GPU compute clusters for large-scale ML training and inference.
Engineer reliable, high-throughput data paths, optimising I/O performance, caching, and storage locality.
Benchmark and troubleshoot compute, network, and orchestration bottlenecks to maximise performance.
Implement observability, automation, and security practices that support compliant, resilient environments.
Collaborate with research and data teams to forecast capacity, manage resources, and streamline ML experimentation pipelines.
Support the transition from traditional HPC systems to modern, containerised, and cloud-native infrastructure.

Requirements

Proven experience designing, building, and maintaining large-scale ML or HPC compute infrastructure.
Deep understanding of GPU architecture, distributed training, and high-speed networking.
Expertise with high-throughput or parallel storage systems for ML/HPC workloads.
Solid grasp of IaC and CI/CD tooling (e.g. Terraform, Argo CD).
Proactive, self-directed approach with strong systems design and problem-solving skills.

Nice to Have

Familiarity with Lustre or similar distributed file systems.
Experience with performance benchmarking, profiling, and cost optimisation.
Background in scientific or research computing environments.

Benefits & conditions

Help build the infrastructure driving breakthrough AI and scientific research.
Work in a collaborative, forward-thinking environment that values innovation and inclusion.
Access modern facilities and advanced technology in a rapidly growing institute.
Competitive salary and comprehensive benefits, including:

Enhanced holiday pay
Pension, life assurance, and income protection
Private medical insurance and therapy services
Electric car scheme and wellbeing perks

About the company

This world-leading AI research firm is reimagining how science and innovation translate into real-world impact. Through interdisciplinary collaboration and cutting-edge facilities, the institute develops end-to-end solutions that tackle humanity's most complex challenges, from sustainable agriculture and healthcare to climate change and artificial intelligence. They providing state-of-the-art laboratories and collaborative environments designed to accelerate discovery and application at scale.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

About the company

Apply for this position

Good distractions

Moments

Videos View all