Senior/Staff Backend Engineer - Distributed System
Role details
Job location
Tech stack
Job description
Our hiring partner is looking for a Backend Engineer to build the systems that orchestrate GPU clusters for AI workloads. You'll develop APIs that manage GPU allocation, memory, compute scheduling, and multi-tenant isolation-challenges unique to AI infrastructure that go far beyond standard backend engineering. On their backend team, you'll tackle questions like: How can high-cost GPU resources be efficiently shared among users? How do we handle memory constraints for large AI models? How do we maintain quality of service when workloads compete for compute? This is an opportunity to build infrastructure where every API call could allocate thousands of dollars of compute per hour, and where your optimizations directly influence whether AI startups can train their models affordably.
What you'll do
- Design APIs that simplify complex GPU operations for developers
- Build scheduling algorithms that maximize GPU utilization while meeting SLAs
- Develop systems to manage the full GPU lifecycle: provisioning, allocation, scheduling, and release
- Implement usage tracking and billing for GPU-hours, memory, and compute utilization
- Create monitoring solutions for GPU-specific metrics, health checks, and automated recovery
- Build multi-tenant systems with resource isolation, quota management, and fair scheduling
- Optimize cold starts for model serving and efficient model loading
- Collaborate with frontend engineers to expose complex infrastructure through intuitive interfaces
- Leverage AI-assisted coding tools (GitHub Copilot, Claude Code, Cursor IDE, etc.) to enhance productivity and code quality
Requirements
- Have 5+ years of backend engineering experience in distributed systems
- Are proficient in Go, Python, or similar backend languages
- Have experience with resource scheduling, orchestration, and API design (REST, GraphQL, gRPC)
- Understand hardware constraints and system optimization
- Have Linux systems and containerization experience (Docker, Kubernetes)
- Are comfortable working with expensive resources where efficiency impacts costs
- Are excited to solve novel problems in AI infrastructure (beyond CRUD apps)
- Bring a startup mindset-comfortable with ambiguity and rapid iteration
Bonus qualifications
- Experience with GPU or HPC cluster management
- Familiarity with ML/AI workload patterns and requirements
- Experience with high-value resource allocation systems
- Background in performance optimization for compute-intensive workloads
- Knowledge of GPU virtualization and sharing technologies
- Experience building billing or metering systems, * Hybrid role: 3 days in office, 2 days WFH; must be located in Palo Alto
- Applicants must be authorized to work in the United States without visa sponsorship
Benefits & conditions
- They offer competitive salary and equity based on experience and skillset