Kubernetes Engineer
Ecloud Labs
yesterday
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
SeniorJob location
Tech stack
Artificial Intelligence
DevOps
Python
Performance Tuning
Role-Based Access Control
Prometheus
Scientific Computating
Systems Integration
Large Language Models
Grafana
Containerization
Kubernetes
Slurm
Terraform
Job description
In this role, you will design, implement, and optimise GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments. You will have deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins and custom operators.
Responsibilities
- Architecting and operating Kubernetes clusters optimised for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator and DCGM
- Developing, deploying and maintaining custom Kubernetes operators and controllers to automate infrastructure services
- Integrating NVIDIA device plugins, Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer
- Optimising GPU utilisation and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm and Volcano
- Collaborating with HPC, ML and DevOps teams to ensure multi-tenant, high-throughput cluster performance
- Driving observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter and OpenTelemetry
- Implementing secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper
- Maintaining CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD and FluxCD
- Contributing to infrastructure-as-code, using Terraform, Helm, and Kustomize
- Participating in performance tuning, incident response and production readiness reviews
Requirements
- Extensive experience with Kubernetes in production-grade environments and working with NVIDIA and Kubernetes, including GPU Operator, device plugin, NVML, MIG and DCGM
- Proficiency in Go or Python for operator development and Kubernetes controller logic
- Deep understanding of Kubernetes internals, including CRDs, RBAC, custom controllers and scheduler extensions
- Experience with GPU-intensive workloads, for example for LLMs, training pipelines and scientific computing
- Hands-on experience with Helm, Kustomize and GitOps workflows
- Familiarity with CNI plugins, especially NVIDIA CNI and Multus
- Experience with monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter