Machine Learning Infrastructure Engineer

AllSTEM Connections
Ontario, United States of America
2 days ago

Role details

Contract type
Temporary contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Ontario, United States of America

Tech stack

Information Engineering
Distributed Computing Environment
Distributed Systems
Machine Learning
Build Management
Kubernetes
Machine Learning Operations
Data Pipelines

Job description

ML Infrastructure & Automation Workflow Architecture: Design and build end-to-end ML workflows and automated pipelines that minimize manual intervention and accelerate the path from experimentation to production. Training & Serving Platforms: Architect and scale our distributed training and model-serving infrastructure. Build the platforms that handle foundational model training, knowledge distillation, and high-performance inference. Data Engineering at Scale: Develop robust data sampling and feature generation platforms that provide high-quality input for our ML systems. Automation & Reliability: Build foundational tools that standardize how we train, track, and deploy models, ensuring high platform reliability and minimal deployment drift.

Performance & Optimization Cost & Efficiency: Drive architectural decisions that optimize our infrastructure footprint. Implement smart resource management and cost-optimization strategies for large-scale training clusters. Developer Productivity: Build "developer-first" internal tools that reduce the cognitive load on researchers, allowing them to focus on model logic rather than infrastructure configuration.

Requirements

Experience Baseline: Minimum of five (5) to ten (10) years of experience in designing, building, and maintaining large-scale ML infrastructure or distributed systems. Infrastructure Mastery: Deep expertise in container orchestration (e.g., Kubernetes), distributed training, and cloud-native infrastructure. Pipeline Expertise: Proven track record of managing massive-scale data pipelines and feature stores. Collaboration: Strong "bridge-builder" personality-you are comfortable working in a high-velocity environment alongside both pure research scientists and core software engineers.

Preferred Attributes Proven background at large-scale AI/ML-driven technology companies. Experience with foundational model training infrastructure and techniques (e.g., model distillation, fine-tuning at scale). A "Scalability Mindset"-you prioritize modular, reusable, and testable code over "quick and dirty" scripts.

About the company

For temporary assignments lasting 13 weeks or longer, AllSTEM Connections is pleased to offer major medical, dental, vision, 401k and any statutory sick pay where required. We are committed to working with and providing reasonable accommodations to individuals with disabilities. If you need a reasonable accommodation for any part of the employment process, please contact your staffing representative who will reach out to our HR team. AllSTEM Connections participates in the E-Verify program in certain locations as required by law. Learn more about the E-Verify program. _Participation_Poster_ES.pdf We also consider for employment qualified applicants regardless of criminal histories, consistent with legal requirements, including, if applicable, the City of Los Angeles' Fair Chance Initiative for Hiring Ordinance. Pursuant to applicable state and municipal Fair Chance Laws and Ordinances, we will consider for employment-qualified applicants with arrest and conviction records, including, if applicable, the San Francisco Fair Chance Ordinance. For Los Angeles, CA applicants: Qualified applications with arrest or conviction records will be considered for employment in accordance with the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act.

Apply for this position