ML Platform Engineer - GPU Infrastructure
Role details
Job location
Tech stack
Job description
Support team by designing, implementing, and maintaining the automation and ML workload enablement layer of the GPU cluster platform. This role focuses on optimizing GPU compute environments for AI/ML training and Isaac Sim simulation workloads, integrating GPU jobs into CI/CD pipelines, standardizing runtime environments, and supporting reliable storage and artifact management., Support GPU cluster platforms for AI/ML and simulation workloads Optimize GPU compute environments for ML training and Isaac Sim execution Integrate GPU workload execution into CI/CD pipelines Standardize runtime environments using containers and automation tools Manage storage, artifacts, and workload outputs Troubleshoot and improve platform reliability, scalability, and performance Collaborate with ML, infrastructure, and engineering teams
Requirements
Do you have experience in Tooling?, Do you have a Master's degree?, 3+ years of experience in ML Platform Engineering, DevOps, Infrastructure Engineering, or related field Bachelor's or Master's degree in Systems Engineering, Computer Science, Computer Engineering, or related discipline, Experience with Linux, Kubernetes, Docker, and GPU infrastructure Knowledge of CI/CD tools and automation scripting (Python/Bash) Experience supporting AI/ML workloads and distributed systems Familiarity with NVIDIA GPU technologies and containerized environments Strong troubleshooting and performance optimization skills
Preferred Skills Experience with Isaac Sim or simulation workloads Exposure to cloud platforms (AWS, Azure, or GCP) Knowledge of monitoring and observability tools such as Grafana or Prometheus