Kubernetes Platform Engineer
Role details
Job location
Tech stack
Job description
We are looking for a Senior Kubernetes Platform Engineer to design, build, and operate mission-critical Kubernetes infrastructure that powers large-scale Machine Learning (ML) and Generative AI (GenAI) workloads. This is not a standard Kubernetes admin role - you will act as a subject matter expert, driving architecture decisions across scheduling, networking, security, storage, and multi-tenancy. You will work closely with ML engineers, researchers, and application teams to build scalable, GPU-optimized platforms that accelerate AI innovation., Kubernetes Platform Engineering
- Design, deploy, and manage multi-cluster Kubernetes environments (EKS, GKE, AKS)
- Build advanced Kubernetes components including CRDs, Operators, admission webhooks, and custom schedulers
- Optimize Kubernetes for GPU workloads (NVIDIA device plugins, MIG, time-slicing)
- Implement autoscaling solutions (HPA, VPA, KEDA, Cluster Autoscaler)
- Enforce security using RBAC, OPA/Gatekeeper, and Pod Security Standards
- Manage service mesh (Istio / Linkerd) for secure and observable microservices
- Configure networking (Cilium, Calico), ingress controllers, and network policies
- Lead cluster lifecycle management (upgrades, backups, disaster recovery)
- Package platform components using Helm and Kustomize
ML / GenAI Infrastructure
- Design ML pipelines using Kubeflow, Argo Workflows, or Ray
- Build scalable model serving platforms (KServe, Triton, TorchServe, vLLM)
- Optimize distributed compute using Ray on Kubernetes
- Design storage solutions for ML datasets and artifacts (EFS, GCS, NFS, etc.)
- Enable GPU-backed environments (JupyterHub, Kubeflow Notebooks)
- Deploy and manage vector databases for RAG applications
- Optimize LLM inference (batching, caching, multi-GPU scaling)
Infrastructure as Code (Terraform)
- Develop and maintain reusable Terraform modules for cloud infrastructure
- Implement remote state management and multi-environment workflows
- Enforce best practices: versioning, drift detection, policy-as-code
- Integrate Terraform into CI/CD pipelines and GitOps workflows
- Use tools like Atlantis or Terraform Cloud for automated deployments
Observability, Security & Reliability
- Build observability stack (Prometheus, Grafana, Loki, Jaeger/Tempo)
- Implement audit logging and runtime security (Falco, SIEM integration)
- Define SLOs/SLIs and maintain platform reliability
- Perform GPU capacity planning and cost optimization
- Lead incident response and post-mortem analysis, * Kubernetes (Expert level)
- Terraform (Advanced)
- Helm / Kustomize
- AWS / Google Cloud Platform / Azure (EKS, GKE, AKS)
- Istio / Linkerd
- Argo Workflows / Kubeflow / Ray
- KServe / Triton
- Prometheus / Grafana
- Cilium / Calico
- OPA / Gatekeeper
- NVIDIA GPU Operator
- Docker / containerd
- GitOps tools (ArgoCD / Flux)
- Python / Go / Bash
- Linux systems and networking
Requirements
- 7+ years in cloud/platform engineering
- 5+ years hands-on Kubernetes in production
- Deep understanding of Kubernetes internals (control plane, CNI, CSI, etc.)
- Experience running GPU-based ML/AI workloads at scale
- Strong Terraform expertise (modules, CI/CD, multi-cloud)
- Experience with ML orchestration tools (Kubeflow, Argo, or Ray)
- Proficiency in at least one programming language (Python, Go, or Bash)
- Experience with GitOps and secure container practices
Preferred Qualifications
- CKA (Certified Kubernetes Administrator) - Required
- CKS (Certified Kubernetes Security Specialist) - Preferred
- CKAD certification
- Cloud DevOps certifications (AWS / Google Cloud Platform)
- Terraform certification
- Experience with Crossplane or multi-cluster management
- Familiarity with eBPF tools (Hubble, Pixie)
- Contributions to CNCF or open-source Kubernetes ecosystem, * A systems thinker with a strong platform mindset
- Proactive and automation-driven
- Comfortable working cross-functionally with ML and engineering teams
- Influential communicator who can drive architecture decisions
- Security-focused and reliability-driven
Why Join Us This role is ideal for engineers passionate about Kubernetes and AI infrastructure who want to build the backbone of next-generation enterprise AI platforms.