Senior Incident Manager
Role details
Job location
Tech stack
Job description
The Senior Incident Manager is responsible for leading the end-to-end lifecycle of operational incidents impacting AI infrastructure and data center services. This individual acts as the central command point during major incidents, ensuring rapid triage, cross-team coordination, effective communication, and structured post-incident analysis.
This role requires deep operational expertise in high-availability infrastructure, large-scale GPU clusters, networking, and cloud platforms, along with strong leadership and communication skills.
What You'll Do
Incident Leadership
- Lead the response to critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations.
- Serve as the Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams.
- Act as the liaison between leadership and external teams during incidents / post-incidents to provide updates and status summaries.
- Establish clear incident timelines, triage actions, and resolution plans.
Incident Management Operations
- Own the incident response lifecycle including:
- Assisting Technical Triage
- Escalation
- Coordination
- Resolution Post-incident review
- Ensure timely and accurate communication with internal stakeholders and leadership.
- Maintain incident response documentation and operational playbooks.
- Conduct analysis on incidents and identify patterns / trends for improvement in response and systems reliability.
- Work in an On-Call Rotation to respond to, lead, and coordinate incidents
Cross-Functional Coordination
- Work closely with:
- Data center operations
- Infrastructure engineering & operations
- Network engineering
- Platform reliability engineering
- Security operations
- Hardware and facility vendors
- Drive alignment during outages involving multiple infrastructure layers.
Post-Incident Analysis & Continuous Improvement
- Lead post-incident reviews (PIRs) and root cause analysis. Identify systemic reliability gaps and implement corrective actions.
- Track incident metrics including MTTR, MTTD, and incident recurrence rates.
Operational Excellence
- Improve incident response processes, escalation paths, and tooling by working with technical support and engineering teams..
- Contribute to runbooks, operational standards, and reliability frameworks.
- Support implementation of automation and observability improvements.
Communication & Reporting
- Provide executive-level incident summaries and reports.
- Deliver clear, concise updates during active incidents.
- Maintain incident dashboards and operational health reporting., * Reduced Mean Time to Resolution (MTTR) for critical incidents
- Improved cross-team incident coordination
- High-quality post-incident reviews and corrective actions
- Increased infrastructure reliability and operational maturity
Requirements
Do you have experience in Stakeholder relationship building?, * 8+ years experience in incident management, site reliability engineering, or infrastructure operations
- Experience managing incidents in large-scale distributed infrastructure environments
- Strong understanding of:
-
Data center operations
-
GPU compute clusters Networking and storage infrastructure
-
Cloud or hybrid infrastructure platforms
- Proven ability to lead high-pressure incident response situations
- Experience with incident management frameworks (ITIL, SRE, or equivalent)
- Excellent communication and stakeholder management skills
- Experience with incident tracking and monitoring tools such as:
- PagerDuty
- ServiceNow
- Jira
- Datadog
- Prometheus / Grafana
Nice to Have
- Experience operating AI or HPC infrastructure
- Background in SRE, infrastructure engineering, or data center operations
- Familiarity with high-density GPU environments (NVIDIA clusters, InfiniBand networks)
- Experience with hyperscale or colocation data center environments
- Knowledge of automation and incident response tooling
- Knowledge of and experience with Incident command system (ICS)
- Experience in leading and developing incident command from stractch
Key Competencies
- Incident Command & Leadership
- Operational Decision Making
- Cross-Team Coordination
- Root Cause Analysis
- Crisis Communication
- Infrastructure Reliability
Benefits & conditions
Pulled from the full job description
- Health insurance
- 401(k) matching
- Paid time off
- Vision insurance
- 401(k) 2% match
- Dental insurance, About Lambda
- Founded in 2012, with 500+ employees, and growing fast
- Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove
- We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
- Our values are publicly available: https://lambda.ai/careers
- We offer generous cash & equity compensation
- Health, dental, and vision coverage for you and your dependents
- Wellness and commuter stipends for select roles
- 401k Plan with 2% company match (USA employees)
- Flexible paid time off plan that we all actually use