Kubernetes Platform Engineer

Xpath Solutions LLC
Charlotte, United States of America
28 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Charlotte, United States of America

Tech stack

Kubernetes Security
Artificial Intelligence
Amazon Web Services (AWS)
Audit Trail
Azure
Backup Devices
Bash
Cloud Computing
Computer Networks
Continuous Integration
Linux
DevOps
Disaster Recovery
Python
Machine Learning
Network Control
Open Source Technology
Role-Based Access Control
Prometheus
Security Information and Event Management
Management of Software Versions
AI Infrastructure
Policy as Code
Google Cloud Platform
Istio
Large Language Models
Grafana
Multi-Cloud
Generative AI
AI Platforms
Kubernetes
Deployment Automation
Linkerd (Service Mesh)
Machine Learning Operations
Terraform
Webhooks
Docker
Programming Languages
Microservices

Job description

We are looking for a Senior Kubernetes Platform Engineer to design, build, and operate mission-critical Kubernetes infrastructure that powers large-scale Machine Learning (ML) and Generative AI (GenAI) workloads. This is not a standard Kubernetes admin role - you will act as a subject matter expert, driving architecture decisions across scheduling, networking, security, storage, and multi-tenancy. You will work closely with ML engineers, researchers, and application teams to build scalable, GPU-optimized platforms that accelerate AI innovation., Kubernetes Platform Engineering

  • Design, deploy, and manage multi-cluster Kubernetes environments (EKS, GKE, AKS)
  • Build advanced Kubernetes components including CRDs, Operators, admission webhooks, and custom schedulers
  • Optimize Kubernetes for GPU workloads (NVIDIA device plugins, MIG, time-slicing)
  • Implement autoscaling solutions (HPA, VPA, KEDA, Cluster Autoscaler)
  • Enforce security using RBAC, OPA/Gatekeeper, and Pod Security Standards
  • Manage service mesh (Istio / Linkerd) for secure and observable microservices
  • Configure networking (Cilium, Calico), ingress controllers, and network policies
  • Lead cluster lifecycle management (upgrades, backups, disaster recovery)
  • Package platform components using Helm and Kustomize

ML / GenAI Infrastructure

  • Design ML pipelines using Kubeflow, Argo Workflows, or Ray
  • Build scalable model serving platforms (KServe, Triton, TorchServe, vLLM)
  • Optimize distributed compute using Ray on Kubernetes
  • Design storage solutions for ML datasets and artifacts (EFS, GCS, NFS, etc.)
  • Enable GPU-backed environments (JupyterHub, Kubeflow Notebooks)
  • Deploy and manage vector databases for RAG applications
  • Optimize LLM inference (batching, caching, multi-GPU scaling)

Infrastructure as Code (Terraform)

  • Develop and maintain reusable Terraform modules for cloud infrastructure
  • Implement remote state management and multi-environment workflows
  • Enforce best practices: versioning, drift detection, policy-as-code
  • Integrate Terraform into CI/CD pipelines and GitOps workflows
  • Use tools like Atlantis or Terraform Cloud for automated deployments

Observability, Security & Reliability

  • Build observability stack (Prometheus, Grafana, Loki, Jaeger/Tempo)
  • Implement audit logging and runtime security (Falco, SIEM integration)
  • Define SLOs/SLIs and maintain platform reliability
  • Perform GPU capacity planning and cost optimization
  • Lead incident response and post-mortem analysis, * Kubernetes (Expert level)
  • Terraform (Advanced)
  • Helm / Kustomize
  • AWS / Google Cloud Platform / Azure (EKS, GKE, AKS)
  • Istio / Linkerd
  • Argo Workflows / Kubeflow / Ray
  • KServe / Triton
  • Prometheus / Grafana
  • Cilium / Calico
  • OPA / Gatekeeper
  • NVIDIA GPU Operator
  • Docker / containerd
  • GitOps tools (ArgoCD / Flux)
  • Python / Go / Bash
  • Linux systems and networking

Requirements

  • 7+ years in cloud/platform engineering
  • 5+ years hands-on Kubernetes in production
  • Deep understanding of Kubernetes internals (control plane, CNI, CSI, etc.)
  • Experience running GPU-based ML/AI workloads at scale
  • Strong Terraform expertise (modules, CI/CD, multi-cloud)
  • Experience with ML orchestration tools (Kubeflow, Argo, or Ray)
  • Proficiency in at least one programming language (Python, Go, or Bash)
  • Experience with GitOps and secure container practices

Preferred Qualifications

  • CKA (Certified Kubernetes Administrator) - Required
  • CKS (Certified Kubernetes Security Specialist) - Preferred
  • CKAD certification
  • Cloud DevOps certifications (AWS / Google Cloud Platform)
  • Terraform certification
  • Experience with Crossplane or multi-cluster management
  • Familiarity with eBPF tools (Hubble, Pixie)
  • Contributions to CNCF or open-source Kubernetes ecosystem, * A systems thinker with a strong platform mindset
  • Proactive and automation-driven
  • Comfortable working cross-functionally with ML and engineering teams
  • Influential communicator who can drive architecture decisions
  • Security-focused and reliability-driven

Why Join Us This role is ideal for engineers passionate about Kubernetes and AI infrastructure who want to build the backbone of next-generation enterprise AI platforms.

Apply for this position