Staff Machine Learning Platform Engineer, AI Evaluation

Apple Inc.

Seattle, United States of America

17 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Junior

Job location

Seattle, United States of America

Tech stack

Artificial Intelligence

Continuous Integration

Software Design Documents

Python

Machine Learning

Rapid Prototyping Process

Software Engineering

Graphics Processing Unit (GPU)

System Availability

Large Language Models

FastAPI

Kubernetes

Dask

Docker

Job description

Join Apple Services Engineering to build the next generation of AI evaluation systems. We are seeking a staff machine learning platform engineer to lead the architectural design and development of the high availability services and internal tools powering self-service evaluation at scale. You will partner with researchers to operationalize their innovations, transforming complex workflows into intuitive, developer-first platforms. We are looking for builders who thrive in the ambiguity of new initiatives and are passionate about creating scalable infrastructure., We're building the evaluation platform that will serve all of Apple's generative AI and agent systems. This is early-stage work - some scrappy components exist, much is greenfield and we need a staff engineer who can take it from here to org-wide self-service scale.

This is not a "maintain the infra" role. You'll make consequential decisions about what to build, what to integrate, and what to say no to then ship it in Python with a small team.

Requirements

8+ years of software engineering experience with a track record of owning platform-level technical direction.
0-to-1 builder who designs for scale. You've taken something from nothing to production, made deliberate tradeoffs about what to build now vs. later, and can articulate why.
ML depth : You're not building the models, but you can read research code and assess: is this a software problem or an infrastructure problem? Do we need a rewrite or do we need GPUs? You speak the language of research engineers fluently.
AI/Agent evaluation experience that goes beyond traces. You understand the hard problems: non-deterministic outputs, multi-step agent reasoning, judge model reliability, scoring drift. You've built or operated systems that handle these.
Judgment under ambiguity. You know when to build a rapid prototype for quick validation and when to be disciplined (design doc, review, test). You can tell the difference in real time, not just in retrospect.
Communication as a core skill. You write clearly design docs, decision records, platform roadmaps. You speak clearly in meetings with researchers, in rooms with engineering leaders, and balance the needs and priorities of partner teams and contribute to the sequencing of execution.
Python as primary language. Strong with FastAPI, Pydantic, and the ecosystem. Experience with job orchestration frameworks (Temporal.io or similar). Bonus: Go or Rust for compute-hot paths.
Operational ownership. You've owned CI/CD, containerization (Docker/K8s), and monitoring for production services. You don't just ship, you keep things running.

Preferred Qualifications

Experience with distributed compute frameworks (Ray, Dask)
Background in startup or early-stage environments where you wore multiple hats
Familiarity with LLM token economics, rate limiting, and cost management at scale

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all