AI Infrastructure Engineer (GPU, Distributed Systems, AI Platforms)

Emporia Consulting Group Limited

Charing Cross, United Kingdom

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Compensation

£ 260K

Job location

Remote

Charing Cross, United Kingdom

Tech stack

Artificial Intelligence

Big Data

C++

Distributed Systems

Fault Tolerance

InfiniBand

Python

Machine Learning

Node.js

Performance Tuning

AI Infrastructure

Graphics Processing Unit (GPU)

High Performance Computing

Software Troubleshooting

AI Platforms

Kubernetes

Bare Metal

Hardware Infrastructure

Job description

Role and responsibilities for the AI Infrastructure Engineer, GPU, Distributed Systems & AI Platforms

Build, operate and optimise large-scale GPU infrastructure for AI training and inference
Support multi-node, multi-GPU environments and distributed workloads
Improve cluster health, fault tolerance and remediation workflows across GPU fleets
Optimise GPU-to-GPU communication, workload performance and infrastructure utilisation
Work with high-performance storage systems supporting large datasets and checkpointing
Build or improve tooling for profiling, monitoring, benchmarking and performance analysis
Collaborate closely with ML researchers, platform teams and infrastructure engineers to remove bottlenecks and improve training efficiency
Support capacity planning and deployment for next-generation compute environments

Requirements

A leading AI business is hiring an AI Infrastructure Engineer who has experience with GPU, Distributed Systems & AI Platforms. Hybrid/Remote options available. Outside IR35. Paying between £800 to £1000 per day.

Experience and skills required for the AI Infrastructure Engineer, GPU, Distributed Systems & AI Platforms

Strong systems-level engineering experience, ideally in infrastructure, HPC, platform engineering or AI/ML environments
Hands-on experience operating large-scale compute or GPU-backed infrastructure
Experience with distributed systems and multi-node environments
Familiarity with NCCL and GPU-to-GPU communication
Experience with Kubernetes, containerised platforms and cluster orchestration
Strong coding ability in Python, Go or C++
Experience working with high-performance storage across complex environments is highly desirable
A strong troubleshooting mindset with the ability to understand behaviour at cluster, hardware and network level

Nice to have

Exposure to InfiniBand, bare-metal provisioning or HPC-style networking
Experience supporting training or inference environments for large-scale ML models
Background in AI infrastructure start-ups, hyperscalers or high-performance compute environments
Experience with profiling / benchmarking tools and performance optimisation at scale

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all