Staff Machine Learning Platform Engineer, AI Evaluation

Apple Inc.
Seattle, United States of America
17 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Junior

Job location

Seattle, United States of America

Tech stack

Artificial Intelligence
Continuous Integration
Software Design Documents
Python
Machine Learning
Rapid Prototyping Process
Software Engineering
Graphics Processing Unit (GPU)
System Availability
Large Language Models
FastAPI
Kubernetes
Dask
Docker

Job description

Join Apple Services Engineering to build the next generation of AI evaluation systems. We are seeking a staff machine learning platform engineer to lead the architectural design and development of the high availability services and internal tools powering self-service evaluation at scale. You will partner with researchers to operationalize their innovations, transforming complex workflows into intuitive, developer-first platforms. We are looking for builders who thrive in the ambiguity of new initiatives and are passionate about creating scalable infrastructure., We're building the evaluation platform that will serve all of Apple's generative AI and agent systems. This is early-stage work - some scrappy components exist, much is greenfield and we need a staff engineer who can take it from here to org-wide self-service scale.

This is not a "maintain the infra" role. You'll make consequential decisions about what to build, what to integrate, and what to say no to then ship it in Python with a small team.

Requirements

  • 8+ years of software engineering experience with a track record of owning platform-level technical direction.
  • 0-to-1 builder who designs for scale. You've taken something from nothing to production, made deliberate tradeoffs about what to build now vs. later, and can articulate why.
  • ML depth : You're not building the models, but you can read research code and assess: is this a software problem or an infrastructure problem? Do we need a rewrite or do we need GPUs? You speak the language of research engineers fluently.
  • AI/Agent evaluation experience that goes beyond traces. You understand the hard problems: non-deterministic outputs, multi-step agent reasoning, judge model reliability, scoring drift. You've built or operated systems that handle these.
  • Judgment under ambiguity. You know when to build a rapid prototype for quick validation and when to be disciplined (design doc, review, test). You can tell the difference in real time, not just in retrospect.
  • Communication as a core skill. You write clearly design docs, decision records, platform roadmaps. You speak clearly in meetings with researchers, in rooms with engineering leaders, and balance the needs and priorities of partner teams and contribute to the sequencing of execution.
  • Python as primary language. Strong with FastAPI, Pydantic, and the ecosystem. Experience with job orchestration frameworks (Temporal.io or similar). Bonus: Go or Rust for compute-hot paths.
  • Operational ownership. You've owned CI/CD, containerization (Docker/K8s), and monitoring for production services. You don't just ship, you keep things running.

Preferred Qualifications

  • Experience with distributed compute frameworks (Ray, Dask)
  • Background in startup or early-stage environments where you wore multiple hats
  • Familiarity with LLM token economics, rate limiting, and cost management at scale

Apply for this position