Staff Software Engineer, AI Infrastructure

Harrison Clarke

Sunnyvale, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Sunnyvale, United States of America

Tech stack

API

Artificial Intelligence

Automated Storage and Retrieval Systems

Cloud Computing

Software Debugging

Distributed Systems

Memory Management

Python

Software Deployment

AI Infrastructure

Datadog

Large Language Models

Multi-Agent Systems

Backend

Event Driven Architecture

Build Management

Containerization

Kubernetes

Low Latency

Machine Learning Operations

TensorRT

Job description

We're seeking an exceptional LLM Infrastructure Engineers to help design and scale the systems that power production-grade AI agents. This is a rare opportunity to work at the intersection of distributed systems, model orchestration, reasoning frameworks, and developer platforms as AI moves from simple chat interfaces to fully autonomous software.

You'll be joining a team focused on building the runtime layer that enables agents to plan, reason, execute actions, utilize tools, manage memory, and operate reliably at scale.

What You'll Work On

Design and build highly scalable agent execution runtimes capable of handling millions of model invocations and tool calls
Develop orchestration systems for multi-agent workflows, planning, task decomposition, and long-running autonomous processes
Build infrastructure for memory management, context retrieval, state persistence, and agent observability
Create reliable execution frameworks for tool use, function calling, code execution, and external integrations
Optimize latency, throughput, reliability, and cost across large-scale LLM deployments
Develop evaluation, monitoring, tracing, and debugging systems for agent performance
Collaborate closely with research and applied AI teams to productionize cutting-edge agent architectures
Help define the infrastructure layer powering the next generation of AI-native applications

Requirements

Strong software engineering fundamentals with expertise in distributed systems and backend infrastructure
Experience building large-scale systems in Python, Go, Rust, or similar languages
Deep understanding of modern LLM architectures and inference workflows
Experience working with agent frameworks, orchestration systems, or AI infrastructure platforms
Familiarity with vector databases, retrieval systems, memory architectures, and context management
Strong knowledge of cloud infrastructure, Kubernetes, containerization, and production deployment practices
Experience designing APIs, SDKs, and developer-facing platforms
Ability to operate across infrastructure, platform, and AI application layers

Strongly Preferred

Experience with LangGraph, OpenAI Agents SDK, AutoGen, CrewAI, Temporal, Prefect, or similar orchestration frameworks
Experience building agent evaluation pipelines and observability tooling
Familiarity with model serving frameworks such as vLLM, TensorRT-LLM, TGI, or Ray Serve
Knowledge of distributed workflow engines and event-driven architectures
Experience scaling AI products from prototype to production

Benefits & conditions

Join a high-caliber team of engineers and researchers from leading AI labs and infrastructure companies
Significant technical ownership and influence over core platform architecture
Competitive compensation package including meaningful equity
Backed by top-tier investors with substantial runway

If you're excited about building the infrastructure layer that enables AI agents to become reliable, scalable, and production-ready, we'd love to speak with you.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all