Staff Software Engineer, AI Infrastructure

Harrison Clarke
Sunnyvale, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Sunnyvale, United States of America

Tech stack

API
Artificial Intelligence
Automated Storage and Retrieval Systems
Cloud Computing
Software Debugging
Distributed Systems
Memory Management
Python
Software Deployment
AI Infrastructure
Datadog
Large Language Models
Multi-Agent Systems
Backend
Event Driven Architecture
Build Management
Containerization
Kubernetes
Low Latency
Machine Learning Operations
TensorRT

Job description

We're seeking an exceptional LLM Infrastructure Engineers to help design and scale the systems that power production-grade AI agents. This is a rare opportunity to work at the intersection of distributed systems, model orchestration, reasoning frameworks, and developer platforms as AI moves from simple chat interfaces to fully autonomous software.

You'll be joining a team focused on building the runtime layer that enables agents to plan, reason, execute actions, utilize tools, manage memory, and operate reliably at scale.

What You'll Work On

  • Design and build highly scalable agent execution runtimes capable of handling millions of model invocations and tool calls
  • Develop orchestration systems for multi-agent workflows, planning, task decomposition, and long-running autonomous processes
  • Build infrastructure for memory management, context retrieval, state persistence, and agent observability
  • Create reliable execution frameworks for tool use, function calling, code execution, and external integrations
  • Optimize latency, throughput, reliability, and cost across large-scale LLM deployments
  • Develop evaluation, monitoring, tracing, and debugging systems for agent performance
  • Collaborate closely with research and applied AI teams to productionize cutting-edge agent architectures
  • Help define the infrastructure layer powering the next generation of AI-native applications

Requirements

  • Strong software engineering fundamentals with expertise in distributed systems and backend infrastructure
  • Experience building large-scale systems in Python, Go, Rust, or similar languages
  • Deep understanding of modern LLM architectures and inference workflows
  • Experience working with agent frameworks, orchestration systems, or AI infrastructure platforms
  • Familiarity with vector databases, retrieval systems, memory architectures, and context management
  • Strong knowledge of cloud infrastructure, Kubernetes, containerization, and production deployment practices
  • Experience designing APIs, SDKs, and developer-facing platforms
  • Ability to operate across infrastructure, platform, and AI application layers

Strongly Preferred

  • Experience with LangGraph, OpenAI Agents SDK, AutoGen, CrewAI, Temporal, Prefect, or similar orchestration frameworks
  • Experience building agent evaluation pipelines and observability tooling
  • Familiarity with model serving frameworks such as vLLM, TensorRT-LLM, TGI, or Ray Serve
  • Knowledge of distributed workflow engines and event-driven architectures
  • Experience scaling AI products from prototype to production

Benefits & conditions

  • Join a high-caliber team of engineers and researchers from leading AI labs and infrastructure companies
  • Significant technical ownership and influence over core platform architecture
  • Competitive compensation package including meaningful equity
  • Backed by top-tier investors with substantial runway

If you're excited about building the infrastructure layer that enables AI agents to become reliable, scalable, and production-ready, we'd love to speak with you.

Apply for this position