Software Engineer - AI Infrastructure
Role details
Job location
Tech stack
Job description
The AI Infrastructure team builds and operates the production systems that power intelligent agents at scale. This team sits at the foundation of the agent platform, ensuring that model inference, orchestration, and execution layers are reliable, observable, and performant under real-world load.
Working closely with the Agent Systems team and broader infrastructure partners, this group owns the core primitives that enable agents to safely operate across internal systems. The environment is high-scale and high-stakes - systems serve millions of users and must meet strict reliability, latency, and correctness standards.
This is a deeply production-oriented team. Engineers here combine strong systems thinking with applied ML infrastructure experience, building in Rust and operating services where performance and failure modes matter., * Design and build the infrastructure layer powering AI agent systems in production
- Develop high-performance Rust services that handle model inference, orchestration, and execution
- Architect scalable systems capable of supporting millions of users and high request throughput
- Build reliable ML infrastructure and MLOps patterns for model deployment, evaluation, and monitoring
- Define guardrails, observability, and failure handling for agent-driven workflows
- Optimize latency, throughput, and cost across inference and orchestration layers
- Partner closely with the Agent Systems team to translate experimental prototypes into hardened production systems
- Contribute to foundational infrastructure decisions in a high-scale, high-impact environment
Requirements
- 5+ years of experience building and operating high-scale production systems
- Strong proficiency in Rust and systems-level programming
- Deep understanding of distributed systems, reliability engineering, and performance optimization
- Experience operating services serving millions of users or high-throughput workloads
- Familiarity with ML infrastructure, model serving, or MLOps in production environments
- Experience designing observability, monitoring, and failure recovery systems
- Strong collaboration skills working across infrastructure and applied engineering teams
- High ownership mindset in high-stakes production environment, * Experience building infrastructure for agent-based or LLM-powered systems
- Background in high-performance networking, async systems, or low-latency architectures
- Experience with container orchestration and cloud-native infrastructure
- Familiarity with evaluation frameworks and model performance monitoring at scale
- Experience working in fast-moving 0*1 or platform-building teams