HPC AI Cloud Engineer
Wide Technology
Manchester, United Kingdom
3 days ago
Role details
Contract type
Temporary contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
EnglishJob location
Manchester, United Kingdom
Tech stack
Build Automation
Bash
Cloud Computing
Profiling
Nvidia CUDA
InfiniBand
Python
Performance Tuning
Remote Direct Memory Access
Ansible
TensorFlow
PyTorch
Kubernetes
Infrastructure Automation Frameworks
Slurm
Terraform
Job description
- Design and execute HPC & AI performance benchmarks (training, inference, scientific workloads)
- Provision and optimize GPU/TPU-based infrastructure on GCP (A3/A4, TPU pods)
- Analyze performance across frameworks (PyTorch, TensorFlow, JAX, CUDA, ROCm)
- Identify system bottlenecks (compute, memory, network, I/O)
- Build automation tools for benchmarking and reporting
- Collaborate with teams to align workloads with optimal architecture
Requirements
- Strong experience with GCP (Compute Engine, GKE, Storage, Networking)
- Hands-on with NVIDIA (CUDA/NCCL), AMD (ROCm), and TPUs (XLA/JAX/TF)
- Solid knowledge of HPC concepts (MPI, RDMA, InfiniBand, Slurm/Kubernetes)
- Experience with performance benchmarks (MLPerf, HPL, NCCL, STREAM)
- Proficiency in Python, Bash, and IaC tools (Terraform/Ansible)
- Ability to analyze profiling tools (Nsight, TensorBoard, PyTorch Profiler)
Candidates will be required to go through background checks before commencing contract.
Must be eligible to live and work in the specified work location. Some occasional travel may be required. Only successful candidates will be contacted
About the company
World Wide Technology (WWT) is a global technology integrator and supply chain solutions provider. Through our culture of innovation, we inspire, build, and deliver business results, from idea to outcome.
World Wide Technology UK is looking for a hands-on Cloud Engineer with strong expertise in HPC and AI/ML performance workloads on Google Cloud Platform (GCP). The role focuses on benchmarking, optimizing, and validating performance across advanced accelerator platforms including NVIDIA GPUs, AMD GPUs, and Google TPUs.