Cloud MLOps Engineer

Insight Global
Austin, United States of America
3 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Compensation
$ 185K

Job location

Austin, United States of America

Tech stack

Artificial Intelligence
Amazon Web Services (AWS)
Azure
Bash
Cloud Computing
Cloud Engineering
Continuous Integration
Python
Machine Learning
Robotic Automation Software
Software Deployment
Data Streaming
Management of Software Versions
Scripting (Bash/Python/Go/Ruby)
Kubernetes
Kafka
Azure
Slurm
Machine Learning Operations

Job description

We are seeking a Cloud MLOps Engineer to build and operate the cloud infrastructure that powers machine learning for a humanoid robotics platform. This role sits at the intersection of ML research, production systems, and end-user applications, with a strong focus on robot telemetry data, model lifecycle management, and production deployment. You will enable researchers and applied ML engineers to reliably train, evaluate, and deploy models at scale, while ensuring telemetry-driven insights flow from robots in the real world back into continuous learning systems.

What You'll Do Design, deploy, and maintain cloud-native MLOps platforms supporting large-scale ML training, evaluation, and inference workloads Operate Kubernetes-based infrastructure (self-managed or managed services such as GKE, EKS, or AKS) for ML workloads and data applications Build and maintain end-to-end ML pipelines that bridge research workflows with production systems Support robot telemetry ingestion, processing, and analytics, enabling model feedback loops from deployed humanoid robots Integrate and operate ML tooling such as MLflow, Weights & Biases, Slurm, or similar systems for experiment tracking, scheduling, and reproducibility Enable model deployment to production, including CI/CD for models, versioning, monitoring, and rollback strategies Partner closely with ML researchers, perception, controls, and applications teams to productionize models safely and efficiently Implement observability across ML systems, including model performance, data drift, and system health Improve reliability, scalability, and security of cloud ML infrastructure supporting real-world robotic systems

Requirements

Strong experience with cloud platforms: AWS, GCP, and/or Azure Hands-on experience operating Kubernetes or managed Kubernetes services in production Experience building or maintaining MLOps platforms supporting training and inference Familiarity with ML experiment tracking and orchestration tools (e.g., MLflow, Weights & Biases, Slurm, Ray, Kubeflow, or similar) Experience deploying ML models into production-facing applications or services Strong understanding of CI/CD, infrastructure-as-code, and automation Proficiency in Python; experience with Bash or another scripting language Ability to collaborate effectively across research and engineering teams

Nice to Have Skills & Experience

Experience working with robotics or real-time telemetry data Familiarity with streaming data systems (e.g., Kafka, Pub/Sub, Kinesis) Experience supporting GPU workloads in cloud or Kubernetes environments Exposure to edge-cloud ML deployment or fleet-based systems Prior work in robotics, autonomy, or embodied AI environments

Benefits & conditions

Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.

Apply for this position