Senior Engineering Manager, Infrastructure
Role details
Job location
Tech stack
Job description
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. They are seeking an experienced engineering leader to scale their Cloud Infrastructure Automation function, responsible for managing Kubernetes operations, global networking, and infrastructure automation. The role involves building a high-performing team, setting infrastructure KPIs, and fostering a culture of inclusion and collaboration.
Responsibilities:
Build and lead a high-performing, diverse engineering team (including ICs and managers) focused on automation, Kubernetes, networking, and global traffic routing.
Set and own core infrastructure KPIs/SLOs (availability, latency, scalability) and continuously improve them.
Architect and scale multi-region Kubernetes clusters, global load balancing, and service mesh deployments.
Automate everything - from provisioning to upgrades to failover - using Terraform, CI/CD, and custom tooling.
Strengthen reliability and security, including network hardening, traffic shaping, and proactive incident prevention.
Partner with research and product teams to streamline the interface between experimental workloads and production deployments.
Coach and grow engineers and managers, fostering technical excellence and leadership depth.
Requirements
Required:
5+ years in engineering leadership, including managing managers and distributed teams.
Deep experience in Kubernetes at scale, networking (L4/L7), service mesh (eg, Istio, Envoy), and cloud-native automation.
Designed and operated global load balancing and traffic routing at Internet scale.
Fluent in infrastructure-as-code, modern CI/CD, and large-scale system observability.
Ability to dive deep technically while also operating at the strategic, organizational level.
Track record of recruiting and retaining top engineering talent in a competitive market.
Excel in environments with high ambiguity and rapid change, bringing clarity and execution focus.
Care deeply about security, reliability, and operational excellence as first-class priorities.