ML Platform Engineer - GPU Infrastructure

Optimal Inc.
Warren, United States of America
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate

Job location

Warren, United States of America

Tech stack

Artificial Intelligence
Amazon Web Services (AWS)
Systems Engineering
Azure
Bash
Computer Engineering
Continuous Integration
Linux
DevOps
Distributed Systems
Monitoring of Systems
Python
Machine Learning
Performance Tuning
Prometheus
Azure
Scripting (Bash/Python/Go/Ruby)
Grafana
Software Troubleshooting
Containerization
Kubernetes
Information Technology
Hardware Infrastructure
Docker

Job description

Support team by designing, implementing, and maintaining the automation and ML workload enablement layer of the GPU cluster platform. This role focuses on optimizing GPU compute environments for AI/ML training and Isaac Sim simulation workloads, integrating GPU jobs into CI/CD pipelines, standardizing runtime environments, and supporting reliable storage and artifact management., Support GPU cluster platforms for AI/ML and simulation workloads Optimize GPU compute environments for ML training and Isaac Sim execution Integrate GPU workload execution into CI/CD pipelines Standardize runtime environments using containers and automation tools Manage storage, artifacts, and workload outputs Troubleshoot and improve platform reliability, scalability, and performance Collaborate with ML, infrastructure, and engineering teams

Requirements

Do you have experience in Tooling?, Do you have a Master's degree?, 3+ years of experience in ML Platform Engineering, DevOps, Infrastructure Engineering, or related field Bachelor's or Master's degree in Systems Engineering, Computer Science, Computer Engineering, or related discipline, Experience with Linux, Kubernetes, Docker, and GPU infrastructure Knowledge of CI/CD tools and automation scripting (Python/Bash) Experience supporting AI/ML workloads and distributed systems Familiarity with NVIDIA GPU technologies and containerized environments Strong troubleshooting and performance optimization skills

Preferred Skills Experience with Isaac Sim or simulation workloads Exposure to cloud platforms (AWS, Azure, or GCP) Knowledge of monitoring and observability tools such as Grafana or Prometheus

Apply for this position