Kubernetes Engineer

Ecloud Labs
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Tech stack

Artificial Intelligence
DevOps
Python
Performance Tuning
Role-Based Access Control
Prometheus
Scientific Computating
Systems Integration
Large Language Models
Grafana
Containerization
Kubernetes
Slurm
Terraform

Job description

In this role, you will design, implement, and optimise GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments. You will have deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins and custom operators.

Responsibilities

  • Architecting and operating Kubernetes clusters optimised for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator and DCGM
  • Developing, deploying and maintaining custom Kubernetes operators and controllers to automate infrastructure services
  • Integrating NVIDIA device plugins, Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer
  • Optimising GPU utilisation and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm and Volcano
  • Collaborating with HPC, ML and DevOps teams to ensure multi-tenant, high-throughput cluster performance
  • Driving observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter and OpenTelemetry
  • Implementing secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper
  • Maintaining CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD and FluxCD
  • Contributing to infrastructure-as-code, using Terraform, Helm, and Kustomize
  • Participating in performance tuning, incident response and production readiness reviews

Requirements

  • Extensive experience with Kubernetes in production-grade environments and working with NVIDIA and Kubernetes, including GPU Operator, device plugin, NVML, MIG and DCGM
  • Proficiency in Go or Python for operator development and Kubernetes controller logic
  • Deep understanding of Kubernetes internals, including CRDs, RBAC, custom controllers and scheduler extensions
  • Experience with GPU-intensive workloads, for example for LLMs, training pipelines and scientific computing
  • Hands-on experience with Helm, Kustomize and GitOps workflows
  • Familiarity with CNI plugins, especially NVIDIA CNI and Multus
  • Experience with monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter

Apply for this position