HPC AI Cloud Engineer

Wide Technology
Manchester, United Kingdom
3 days ago

Role details

Contract type
Temporary contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Manchester, United Kingdom

Tech stack

Build Automation
Bash
Cloud Computing
Profiling
Nvidia CUDA
InfiniBand
Python
Performance Tuning
Remote Direct Memory Access
Ansible
TensorFlow
PyTorch
Kubernetes
Infrastructure Automation Frameworks
Slurm
Terraform

Job description

  • Design and execute HPC & AI performance benchmarks (training, inference, scientific workloads)
  • Provision and optimize GPU/TPU-based infrastructure on GCP (A3/A4, TPU pods)
  • Analyze performance across frameworks (PyTorch, TensorFlow, JAX, CUDA, ROCm)
  • Identify system bottlenecks (compute, memory, network, I/O)
  • Build automation tools for benchmarking and reporting
  • Collaborate with teams to align workloads with optimal architecture

Requirements

  • Strong experience with GCP (Compute Engine, GKE, Storage, Networking)
  • Hands-on with NVIDIA (CUDA/NCCL), AMD (ROCm), and TPUs (XLA/JAX/TF)
  • Solid knowledge of HPC concepts (MPI, RDMA, InfiniBand, Slurm/Kubernetes)
  • Experience with performance benchmarks (MLPerf, HPL, NCCL, STREAM)
  • Proficiency in Python, Bash, and IaC tools (Terraform/Ansible)
  • Ability to analyze profiling tools (Nsight, TensorBoard, PyTorch Profiler)

Candidates will be required to go through background checks before commencing contract.

Must be eligible to live and work in the specified work location. Some occasional travel may be required. Only successful candidates will be contacted

About the company

World Wide Technology (WWT) is a global technology integrator and supply chain solutions provider. Through our culture of innovation, we inspire, build, and deliver business results, from idea to outcome. World Wide Technology UK is looking for a hands-on Cloud Engineer with strong expertise in HPC and AI/ML performance workloads on Google Cloud Platform (GCP). The role focuses on benchmarking, optimizing, and validating performance across advanced accelerator platforms including NVIDIA GPUs, AMD GPUs, and Google TPUs.

Apply for this position