ML Platform Engineer - GPU Infrastructure

Optimal Inc.

Warren, United States of America

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Warren, United States of America

Tech stack

Artificial Intelligence

Amazon Web Services (AWS)

Systems Engineering

Azure

Bash

Computer Engineering

Continuous Integration

Linux

DevOps

Distributed Systems

Monitoring of Systems

Python

Machine Learning

Performance Tuning

Prometheus

Azure

Scripting (Bash/Python/Go/Ruby)

Grafana

Software Troubleshooting

Containerization

Kubernetes

Information Technology

Hardware Infrastructure

Docker

Job description

Support team by designing, implementing, and maintaining the automation and ML workload enablement layer of the GPU cluster platform. This role focuses on optimizing GPU compute environments for AI/ML training and Isaac Sim simulation workloads, integrating GPU jobs into CI/CD pipelines, standardizing runtime environments, and supporting reliable storage and artifact management., Support GPU cluster platforms for AI/ML and simulation workloads Optimize GPU compute environments for ML training and Isaac Sim execution Integrate GPU workload execution into CI/CD pipelines Standardize runtime environments using containers and automation tools Manage storage, artifacts, and workload outputs Troubleshoot and improve platform reliability, scalability, and performance Collaborate with ML, infrastructure, and engineering teams

Requirements

Do you have experience in Tooling?, Do you have a Master's degree?, 3+ years of experience in ML Platform Engineering, DevOps, Infrastructure Engineering, or related field Bachelor's or Master's degree in Systems Engineering, Computer Science, Computer Engineering, or related discipline, Experience with Linux, Kubernetes, Docker, and GPU infrastructure Knowledge of CI/CD tools and automation scripting (Python/Bash) Experience supporting AI/ML workloads and distributed systems Familiarity with NVIDIA GPU technologies and containerized environments Strong troubleshooting and performance optimization skills

Preferred Skills Experience with Isaac Sim or simulation workloads Exposure to cloud platforms (AWS, Azure, or GCP) Knowledge of monitoring and observability tools such as Grafana or Prometheus

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all