Technical Lead Manager (ML Platform Infrastructure)

Nuro Inc.
Mountain View, United States of America
10 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 235K

Job location

Remote
Mountain View, United States of America

Tech stack

Amazon Web Services (AWS)
Systems Engineering
Azure
Big Data
Cloud Computing
Cloud Engineering
Computer Clusters
ETL
Device Drivers
Distributed Computing Environment
Distributed Systems
Machine Learning
Redis
Azure
Ceph
Spark
Caching
Backend
Kubernetes
Information Technology
Slurm
Machine Learning Operations
Apache Beam
Nvme

Job description

  • Nuro is seeking an experienced Technical Lead Manager with deep expertise in large-scale infrastructure, workload orchestration, as well as batch and streaming data processing systems to join our ML Infrastructure team
  • In this role, you will lead the evolution of our core platform, ensuring our researchers and engineers have seamless access to the compute and data resources required to build the future of autonomous driving
  • You will drive the strategy for automated resource provisioning, high-performance workload scheduling, and efficient feature management
  • As a TLM, you will balance technical hands-on leadership with people management, mentoring a high-performing team while partnering closely with ML Research and Autonomy teams to eliminate infrastructure bottlenecks and accelerate the Nuro Driver development lifecycle
  • Setting Technical Strategy: Defining the roadmap for a unified ML platform that abstracts complex cloud infrastructure
  • Resource Provisioning & IaC: Scaling our automated infrastructure-as-code (IaC) pipelines to manage thousands of GPU/CPU nodes across diverse environments
  • Intelligent Scheduling: Designing and optimizing workload orchestration to maximize hardware utilization, minimize job wait times, and handle massive-scale distributed training
  • Data Dumping & ETL: Designing robust pipelines for the extraction and transformation of petabyte-scale sensor and telemetry data into ML-ready formats
  • Feature Caching & Feature Stores: Implementing robust feature caching and storage solutions to reduce redundant computations and ensure low-latency access to pre-computed features
  • Team Leadership: Mentoring and growing a team of software and systems engineers, fostering a culture of operational excellence and technical innovation

Requirements

  • Resource Provisioning: Deep familiarity with modern Infrastructure-as-Code and provisioning tools (e.g., Terraform, Pulumi, or Crossplane)
  • Feature Management: Experience implementing or maintaining feature stores and caching layers (e.g., Feast, Hopsworks, or Redis-based custom caching)
  • Experience: 6+ years of professional experience in ML Infrastructure, Backend Platform Engineering, or Distributed Systems with 3+ years of people/team management experience
  • Workload Scheduling: Hands-on experience building or managing large-scale orchestrators for compute-heavy workloads (e.g., Kubernetes/KubeRay, Ray, Slurm, or Volcano)
  • Data Dumping (ETL): Proven expertise in large-scale data extraction and transformation. You must be proficient in at least one distributed processing framework, such as Apache Spark or Apache Beam
  • Experience with high-performance storage systems (e.g., Lustre, Ceph, or specialized NVMe caching) for ML data loading
  • Knowledge of cost-optimization strategies for large-scale GPU clusters in public clouds (AWS/GCP/Azure)
  • Active contributor to open-source projects in the MLOps or Cloud-Native ecosystem (e.g., CNCF, Ray, or Kubeflow communities)
  • Advanced degree (Ph.D. or M.Sc.) in Computer Science, Systems Engineering, or a related technical field

Benefits & conditions

  • Free Caltrain pass and commuter benefits
  • Company stock options
  • Work from home opportunities
  • Health insurance

Apply for this position