AI Infrastructure Engineer (GPU, Distributed Systems, AI Platforms)
Emporia Consulting Group Limited
Charing Cross, United Kingdom
2 days ago
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Compensation
£ 260KJob location
Remote
Charing Cross, United Kingdom
Tech stack
Artificial Intelligence
Big Data
C++
Distributed Systems
Fault Tolerance
InfiniBand
Python
Machine Learning
Node.js
Performance Tuning
AI Infrastructure
Graphics Processing Unit (GPU)
High Performance Computing
Software Troubleshooting
AI Platforms
Kubernetes
Bare Metal
Hardware Infrastructure
Job description
Role and responsibilities for the AI Infrastructure Engineer, GPU, Distributed Systems & AI Platforms
- Build, operate and optimise large-scale GPU infrastructure for AI training and inference
- Support multi-node, multi-GPU environments and distributed workloads
- Improve cluster health, fault tolerance and remediation workflows across GPU fleets
- Optimise GPU-to-GPU communication, workload performance and infrastructure utilisation
- Work with high-performance storage systems supporting large datasets and checkpointing
- Build or improve tooling for profiling, monitoring, benchmarking and performance analysis
- Collaborate closely with ML researchers, platform teams and infrastructure engineers to remove bottlenecks and improve training efficiency
- Support capacity planning and deployment for next-generation compute environments
Requirements
A leading AI business is hiring an AI Infrastructure Engineer who has experience with GPU, Distributed Systems & AI Platforms. Hybrid/Remote options available. Outside IR35. Paying between £800 to £1000 per day.
Experience and skills required for the AI Infrastructure Engineer, GPU, Distributed Systems & AI Platforms
- Strong systems-level engineering experience, ideally in infrastructure, HPC, platform engineering or AI/ML environments
- Hands-on experience operating large-scale compute or GPU-backed infrastructure
- Experience with distributed systems and multi-node environments
- Familiarity with NCCL and GPU-to-GPU communication
- Experience with Kubernetes, containerised platforms and cluster orchestration
- Strong coding ability in Python, Go or C++
- Experience working with high-performance storage across complex environments is highly desirable
- A strong troubleshooting mindset with the ability to understand behaviour at cluster, hardware and network level
Nice to have
- Exposure to InfiniBand, bare-metal provisioning or HPC-style networking
- Experience supporting training or inference environments for large-scale ML models
- Background in AI infrastructure start-ups, hyperscalers or high-performance compute environments
- Experience with profiling / benchmarking tools and performance optimisation at scale