Founding Engineer - Site Reliability

Urun LLC
San Francisco, United States of America
9 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 285K

Job location

San Francisco, United States of America

Tech stack

Amazon Web Services (AWS)
Cloud Computing
Software Debugging
Reliability Engineering
Prometheus
Runbook
Datadog
Data Logging
Grafana
Amazon Web Services (AWS)
Kubernetes
Machine Learning Operations
Virtual Private Clouds

Job description

Reliability at uRun isn't a feature - it's the product. When model labs and production teams build on top of our inference platform, they are trusting us with their uptime, their latency, and their users. As our Site Reliability Engineer, you will own that trust end-to-end.

This is a founding SRE hire. You will define the reliability culture from scratch: the observability stack, the incident response playbooks, the SLOs, and the on-call process. You will work directly with infrastructure and platform engineers to close the gap between what we ship and what stays up.

What you'll actually be doing day-to-day

  • Define and own SLOs and error budgets across uRun's inference platform and supporting infrastructure
  • Build and maintain the observability stack end-to-end: metrics, logging, tracing, and alerting across a distributed GPU compute environment
  • Lead incident response: detection, triage, resolution, and blameless postmortems that drive lasting fixes
  • Partner with ML infrastructure engineers to embed reliability into the deployment pipeline from day one
  • Design and maintain runbooks, on-call rotations, and escalation paths as the team scales
  • Drive capacity planning and traffic management across heterogeneous compute to protect latency and availability under load
  • Identify and eliminate toil through automation, building systems that scale without scaling the team proportionally

Requirements

Do you have experience in Virtual Private Clouds?, * 7+ years in site reliability, production engineering, or infrastructure engineering in a high-availability, low-latency environment

  • Deep experience owning SLOs, error budgets, and on-call processes in production at scale
  • Strong observability background: you have built or owned monitoring stacks (Prometheus, Grafana, Datadog, or equivalent) and know what good alerting looks like
  • Proven incident response experience: you have led real incidents under pressure and written postmortems that actually changed behaviour
  • Hands-on with Kubernetes and cloud infrastructure (AWS preferred): you can debug a failing pod and a misconfigured VPC in the same afternoon
  • Strong software engineering fundamentals: you write automation, not just runbooks
  • Comfortable operating as the first and only SRE, setting standards without a template to follow

Things that will give you an edge

  • Experience supporting GPU compute or ML inference infrastructure in production
  • Familiarity with stateful workloads, long-running sessions, or streaming inference systems
  • Exposure to multi-tenant platforms where isolation, noisy neighbour problems, and billing-aware scheduling matter
  • Prior founding or sole SRE experience at an early-stage company

Benefits & conditions

Pulled from the full job description

  • 401(k)
  • Health insurance
  • Paid time off
  • Vision insurance
  • Health savings account
  • Dental insurance
  • Flexible spending account, Competitive salary and meaningful equity in an early-stage AI infrastructure company. The band above is our target; for an exceptional candidate we'll go higher. Equity is real - you're early, and the grant reflects that.
  • Health, dental, and vision - full coverage
  • 401(k) - company-supported retirement savings
  • FSA/HSA - flexible spending accounts for healthcare costs
  • Paid time off - we trust you to manage your time
  • Top-tier tooling - access to the best AI tools available: Claude, Codex, Kimi, and whatever else helps you move faster
  • MacBook Pro and AirPods - the hardware you need, on us

How we work (and what that feels like day-to-day)

We build the stage, not the show. We're an infrastructure company, a developer-tools company, and a production partner for model labs, and focus is a deliberate choice we've made and hold to.

Day-to-day, that means a small team, a high bar, and real ownership. You won't wait for permission or inherit a backlog of someone else's decisions, in a founding security role, the function is what you make it.

About the company

uRun is the inference cloud for interactive AI: the compute layer that makes real-time, stateful inference possible at scale. We came out of stealth in April 2026, are backed by top-tier investors, and are founded by Keegan McCallum, who scaled inference infrastructure for some of the most demanding generative AI workloads in production.

Apply for this position