ML Infrastructure & MLOps Engineer
Role details
Job location
Tech stack
Job description
ML Infrastructure & Container Orchestration Distributed Clusters: Architect and maintain high-performance training and serving infrastructure utilizing Google Kubernetes Engine (GKE). Model Optimization: Design and implement high-efficiency optimization pipelines, including advanced knowledge distillation and foundational training tooling. Platform Scaling: Build, monitor, and optimize shared ML systems to ensure maximum infrastructure uptime, pipeline reliability, and cloud cost-efficiency.
Data Engineering & Pipeline Automation Workflow Automation: Build robust, automated pipelines for standardized model training, validation, and continuous deployment (CI/CD for ML). Feature Platforms: Develop scalable data sampling and feature-generation platforms to accelerate research experimentation cycles. Onboarding & Usability: Drive high platform adoption by building intuitive, standardized deployment tools that decrease onboarding speed for research and engineering teams.
Collaboration & Governance Cross-Functional Bridge: Collaborate closely with ML researchers and core software engineers to translate theoretical models into highly scalable production systems. Methodical Execution: Apply a disciplined, data-backed approach to identify infrastructure bottlenecks, reduce time-to-market, and stabilize complex deployments.
Requirements
Experience: o5 to 10+ years of hands-on experience designing and operating large-scale distributed ML platforms. oProven track record of supporting production-grade ML workflows in cloud environments. Technical Mastery: oDeep expertise in container orchestration, specifically GKE (Google Kubernetes Engine) or equivalent enterprise Kubernetes environments. oHands-on experience building scalable ML pipelines (e.g., Kubeflow, Airflow, TFX). oStrong proficiency in distributed training strategies, feature store management, and model serving infrastructure. Soft Skills & Attributes: oPragmatic Mindset: Strong ownership-driven work style focused on consistency, system reliability, and cost-awareness. oEffective Communicator: Ability to collaborate seamlessly with highly technical researchers and platform engineers alike.
Preferred Qualifications Prior experience working within dedicated, tier-1 enterprise ML/AI platform teams. Deep knowledge of distributed systems backend optimization and infrastructure-as-code (IaC).