Machine Learning Infrastructure Engineer
Role details
Job location
Tech stack
Job description
ML Infrastructure & Automation Workflow Architecture: Design and build end-to-end ML workflows and automated pipelines that minimize manual intervention and accelerate the path from experimentation to production. Training & Serving Platforms: Architect and scale our distributed training and model-serving infrastructure. Build the platforms that handle foundational model training, knowledge distillation, and high-performance inference. Data Engineering at Scale: Develop robust data sampling and feature generation platforms that provide high-quality input for our ML systems. Automation & Reliability: Build foundational tools that standardize how we train, track, and deploy models, ensuring high platform reliability and minimal deployment drift.
Performance & Optimization Cost & Efficiency: Drive architectural decisions that optimize our infrastructure footprint. Implement smart resource management and cost-optimization strategies for large-scale training clusters. Developer Productivity: Build "developer-first" internal tools that reduce the cognitive load on researchers, allowing them to focus on model logic rather than infrastructure configuration.
Requirements
Experience Baseline: Minimum of five (5) to ten (10) years of experience in designing, building, and maintaining large-scale ML infrastructure or distributed systems. Infrastructure Mastery: Deep expertise in container orchestration (e.g., Kubernetes), distributed training, and cloud-native infrastructure. Pipeline Expertise: Proven track record of managing massive-scale data pipelines and feature stores. Collaboration: Strong "bridge-builder" personality-you are comfortable working in a high-velocity environment alongside both pure research scientists and core software engineers.
Preferred Attributes Proven background at large-scale AI/ML-driven technology companies. Experience with foundational model training infrastructure and techniques (e.g., model distillation, fine-tuning at scale). A "Scalability Mindset"-you prioritize modular, reusable, and testable code over "quick and dirty" scripts.