Full Stack Software Engineer - ML Compute Capacity

Apple Inc.
Santa Clara, United States of America
7 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Santa Clara, United States of America

Tech stack

API
Cloud Storage
Information Engineering
Software Debugging
Distributed Systems
Elasticsearch
Monitoring of Systems
Python
PostgreSQL
Machine Learning
Prometheus
Web Application Frameworks
React
Grafana
Backend
Kubernetes
Optimization Algorithms
REST
Data Pipelines

Job description

As a senior engineer on the ML Compute Capacity team, you will design, build, and operate the production systems that ensure compute resources are optimally distributed throughout the company. You'll work across the stack - from data pipelines and backend services to APIs and interactive frontends - developing telemetry systems, optimization algorithms, policies, and intuitive tools for managing demand and improving efficiency across Apple's largest accelerator fleet. Our small, nimble team works in a high-autonomy, fast-paced environment, and we're passionate about digging into data patterns, laying out the performance characteristics of an entire distributed system, and knowledge sharing. If the opportunity to own and operate services that scale, stay highly available, and "just work" excites you, then please reach out to us!

Requirements

  • 5+ years of experience in relevant areas
  • Proficiency in Python for production backend and data engineering work
  • Experience building data pipelines and crafting robust queries over large-scale, multi-source data (e.g., Trino, PostgreSQL, Elasticsearch)
  • Experience designing and building RESTful APIs and working with cloud storage technologies
  • Experience with modern web frameworks like React
  • Experience with observability tools (e.g., Prometheus, Grafana) or equivalent monitoring systems
  • Excellent problem-framing and problem-solving skills
  • Strong CS fundamentals
  • Bachelor's degree or higher in Engineering, Mathematics, Economics, or a related quantitative field, * Experience operating Kubernetes at production scale - including scheduling, resource management, and cluster debugging
  • Familiarity with accelerator utilization patterns across ML training and inference
  • Strong interest with capacity planning, cost attribution, or FinOps systems

About the company

Scaling machine learning workloads across thousands of accelerators creates challenges that few engineers ever encounter. In Apple's Machine Learning Platform Technologies organization, we build the infrastructure that powers large-scale ML training and inference workloads, bringing together expertise in distributed systems, machine learning infrastructure, and high-performance computing.

Apply for this position