Full Stack Software Engineer - ML Compute Capacity
Role details
Job location
Tech stack
Job description
As a senior engineer on the ML Compute Capacity team, you will design, build, and operate the production systems that ensure compute resources are optimally distributed throughout the company. You'll work across the stack - from data pipelines and backend services to APIs and interactive frontends - developing telemetry systems, optimization algorithms, policies, and intuitive tools for managing demand and improving efficiency across Apple's largest accelerator fleet. Our small, nimble team works in a high-autonomy, fast-paced environment, and we're passionate about digging into data patterns, laying out the performance characteristics of an entire distributed system, and knowledge sharing. If the opportunity to own and operate services that scale, stay highly available, and "just work" excites you, then please reach out to us!
Requirements
- 5+ years of experience in relevant areas
- Proficiency in Python for production backend and data engineering work
- Experience building data pipelines and crafting robust queries over large-scale, multi-source data (e.g., Trino, PostgreSQL, Elasticsearch)
- Experience designing and building RESTful APIs and working with cloud storage technologies
- Experience with modern web frameworks like React
- Experience with observability tools (e.g., Prometheus, Grafana) or equivalent monitoring systems
- Excellent problem-framing and problem-solving skills
- Strong CS fundamentals
- Bachelor's degree or higher in Engineering, Mathematics, Economics, or a related quantitative field, * Experience operating Kubernetes at production scale - including scheduling, resource management, and cluster debugging
- Familiarity with accelerator utilization patterns across ML training and inference
- Strong interest with capacity planning, cost attribution, or FinOps systems