Incident Manager (SRE / Operations)

RealTek Consulting
Philadelphia, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Philadelphia, United States of America

Tech stack

DevOps
Monitoring of Systems
Reliability Engineering
Grafana
Reliability of Systems

Job description

  • Lead incident command and management for critical production issues
  • Coordinate cross-functional teams during high-severity incidents
  • Drive root cause analysis (RCA) and implement preventive measures
  • Manage system reliability and operational stability
  • Collaborate with SRE, DevOps, and engineering teams
  • Ensure effective communication with stakeholders and leadership
  • Drive automation and observability improvements
  • Handle large-scale change events and system outages
  • Maintain incident reports, documentation, and post-mortem analysis
  • Continuously improve incident response processes and frameworks

Requirements

We are seeking experienced Incident Managers with strong expertise in SRE, operations engineering, and incident command. The ideal candidate will lead high-impact incident response, ensure system reliability, and drive cross-functional coordination during outages and large-scale system events., * 6-8 years of experience in:

  • Incident Management / Production Support / SRE roles
  • Strong expertise in:
  • Incident Command & Crisis Management
  • Site Reliability Engineering (SRE)
  • Operations Engineering
  • Strong knowledge of:
  • Reliability architecture and system design
  • Automation and observability tools
  • Proven ability to:
  • Lead teams during high-impact outages
  • Drive systemic problem resolution
  • Excellent executive communication and stakeholder management skills, * Incident Management
  • SRE / Operations Engineering
  • Monitoring & Observability Tools
  • Automation & Reliability Engineering, * Experience in enterprise-scale production environments
  • Strong analytical and problem-solving skills
  • Ability to work in high-pressure, fast-paced environments, * Rapid and effective incident resolution

Apply for this position