Platform Engineer (AI/LLM Infrastructure)
Role details
Job location
Tech stack
Job description
Lead the design, implementation, and operation of scalable infrastructure platforms supporting AI/LLM-based solutions for enterprise clients Act as a hands-on technical lead (player-coach), contributing to development while guiding a team of engineers Own end-to-end infrastructure architecture below the application layer, including compute, container orchestration, CI/CD, observability, and security Partner directly with clients and stakeholders to design, present, and deliver robust AI infrastructure solutions Architect and manage production-grade Kubernetes environments (AKS/EKS), including cluster operations and RBAC Design and operationalize RAG pipelines, including ingestion, chunking, embedding workflows, and vector database management Lead GPU infrastructure provisioning and optimization (NVIDIA A100/H100 or similar) Drive Infrastructure-as-Code adoption using Terraform and GitOps practices (ArgoCD/Flux) Build and maintain CI/CD pipelines using GitHub Actions and Azure DevOps Establish observability standards using Datadog, OpenTelemetry, and ELK/OpenSearch Lead incident response, on-call processes, and post-mortem analysis Ensure strong security posture and lead InfoSec review processes Coordinate delivery across multiple teams and client engagements
Requirements
5-8 years of experience in Platform Engineering, SRE, or Infrastructure Engineering 3+ years of Proven experience delivering and leading infrastructure for AI/LLM-based production systems Strong hands-on expertise in Kubernetes, Docker, Helm 3+ years of experience with Terraform and GitOps (ArgoCD/Flux) 3+ years of experience with Azure (Key Vault, Monitor, DevOps Pipelines) 3+ years of experience leading client-facing technical engagements 3+ years of experience managing multiple concurrent projects or teams 3+ years of Hands-on experience with incident management and SLA-driven environments 3+ years of Experience leading security/InfoSec reviews Strong understanding of vector databases, RAG pipelines, and LLM inference systems 3+ years of Experience with CI/CD and container registry management Degree: Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.
Nice to Have (But Not Required): Experience with AWS in addition to Azure Familiarity with Azure API Management and AKS Experience with Pulumi (Python/TypeScript) Knowledge of NIM deployment and lifecycle management Python scripting for infrastructure automation Experience with load testing tools (k6, Locust, JMeter) Exposure to FinOps and cost optimization practices