Incident Manager (SRE / Operations)
RealTek Consulting
Philadelphia, United States of America
yesterday
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
SeniorJob location
Philadelphia, United States of America
Tech stack
DevOps
Monitoring of Systems
Reliability Engineering
Grafana
Reliability of Systems
Job description
- Lead incident command and management for critical production issues
- Coordinate cross-functional teams during high-severity incidents
- Drive root cause analysis (RCA) and implement preventive measures
- Manage system reliability and operational stability
- Collaborate with SRE, DevOps, and engineering teams
- Ensure effective communication with stakeholders and leadership
- Drive automation and observability improvements
- Handle large-scale change events and system outages
- Maintain incident reports, documentation, and post-mortem analysis
- Continuously improve incident response processes and frameworks
Requirements
We are seeking experienced Incident Managers with strong expertise in SRE, operations engineering, and incident command. The ideal candidate will lead high-impact incident response, ensure system reliability, and drive cross-functional coordination during outages and large-scale system events., * 6-8 years of experience in:
- Incident Management / Production Support / SRE roles
- Strong expertise in:
- Incident Command & Crisis Management
- Site Reliability Engineering (SRE)
- Operations Engineering
- Strong knowledge of:
- Reliability architecture and system design
- Automation and observability tools
- Proven ability to:
- Lead teams during high-impact outages
- Drive systemic problem resolution
- Excellent executive communication and stakeholder management skills, * Incident Management
- SRE / Operations Engineering
- Monitoring & Observability Tools
- Automation & Reliability Engineering, * Experience in enterprise-scale production environments
- Strong analytical and problem-solving skills
- Ability to work in high-pressure, fast-paced environments, * Rapid and effective incident resolution