AI Infrastructure Engineer
Role details
Job location
Tech stack
Job description
- Design, build, and maintain a scalable platform for serving LLM workloads in production
- Deploy and manage containerised workloads on Kubernetes, including GPU-based infrastructure
- Implement and optimise model serving solutions (e.g. vLLM, Triton, TGI)
- Set up monitoring and observability using tools such as Prometheus and Grafana
- Build and improve CI/CD pipelines and automate infrastructure using Python and Infrastructure as Code
Requirements
You are a platform or DevOps engineer with strong experience running complex systems in production, ideally with exposure to AI/ML infrastructure and large-scale environments. You understand how to operate workloads reliably at scale and are comfortable working with modern tooling across cloud, Kubernetes, and automation. You are focused on infrastructure and platform engineering rather than data science, with a strong emphasis on reliability, performance, and operational excellence., * Strong experience with Kubernetes in production and solid Linux systems knowledge
- Hands-on experience with GPU infrastructure (e.g. NVIDIA A100/H100) and LLM/ML model serving
- Experience with CI/CD tools (Azure DevOps, GitLab CI, Jenkins) and Python scripting
- Familiarity with monitoring tools (Prometheus, Grafana) and infrastructure automation (Terraform, Ansible)
- Experience in regulated environments or cost optimisation for high-performance workloads is a plus
- We specifically need people who have worked with large language models and GPU-based inference at scale