AI Infrastructure Engineer
Role details
Job location
Tech stack
Job description
- Build and extend platform capabilities to enable new classes of workloads (e.g., interactive development pods, CI pipelines, inference services, benchmarking jobs).
- Design and operate scalable orchestration systems using Kubernetes across both on-prem and multi-cloud environments.
- Develop platform features such as secret management, configuration management, and deployment automation for customers.
- Partner with development teams to extend the GPU developer platform with features, APIs, templates, and self-service workflows that streamline job orchestration and environment management.
- Manage service lifecycle within Kubernetes using Helm and GitOps workflows (e.g., ArgoCD or Flux).
- Apply expertise in storage and networking to design and integrate CSI drivers, persistent volumes, and network policies that enable high-performance GPU workloads., AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's "Responsible AI Policy" is available here.
Requirements
We are seeking a DevOps / Platform Engineer to join our team building and operating large-scale GPU compute infrastructure that powers AI and ML workloads. The ideal candidate should be passionate about software engineering and possess leadership skills to independently deliver on multi-quarter projects. They should be able to caommunicate effectively and work optimally with their peers within our larger organization. Finally, you aren't afraid of a team in more of a startup mode at a larger company and willing to jump in to help in areas adjacent to your main project as needed., * 5+ years of experience in DevOps, Platform, or Infrastructure Engineering.
- Deep hands-on experience with Kubernetes and container orchestration at scale.
- Proven ability to design and deliver platform features that serve internal customers or developer teams
- Experience building developer-facing platforms or internal developer portals (e.g.custom workflow tooling).
Nice to Have
- Hands-on experience in storage or network engineering within Kubernetes environments (e.g., CSI drivers, dynamic provisioning, CNI plugins, or network policy).
- Experience with Infrastructure as Code tools like Terraform.
- Background in HPC, Slurm, or GPU-based compute systems for ML/AI workloads.
- Practical experience with monitoring and observability tools (Prometheus, Grafana, Loki, etc.).
- Understanding of machine learning frameworks (PyTorch, vLLM, SGLang, etc.).
#LI-G11