Senior Site Reliability Champion
Role details
Job location
Tech stack
Job description
-
Evaluate applications, platforms, and vendors to assess resiliency, reliability, and operational risk.
-
Design and implement processes that enforce enterprise resiliency and reliability standards.
-
Lead blameless post-incident reviews for high-severity incidents or incidents spanning multiple complex product families.
-
Partner with product and platform teams to proactively identify and remediate reliability risks before they impact clients.
-
Develop, communicate, and evangelize new standards, tools, and frameworks across subdivisions, ensuring consistent adoption.
-
Troubleshoot complex production issues and implement durable solutions that prevent recurrence.
-
Participate in a periodic on-call rotation to support production stability.
-
Evaluate and onboard resiliency and reliability tooling.
-
Actively participate in reliability engineering and resilience communities of practice, contributing to shared learning and enterprise consistency.
-
Contribute to strategic initiatives that advance Vanguard's operational maturity and resiliency posture.
Requirements
-
Observability Platforms: Experience with modern observability and monitoring tools, such as Splunk, Honeycomb, CloudWatch, Dynatrace, or AppDynamics.
-
Reliability Metrics: Strong understanding of SLIs, SLOs, and SLAs, including dashboarding and reporting practices.
-
Monitoring & Alerting: Experience with alert design, anomaly detection, predictive alerting, and synthetic monitoring using structured methodologies.
-
Automation & Resilience Engineering: Experience with automation and resilience practices such as Python-based automation, RPA platforms (e.g., Blue Prism, UiPath), chaos engineering, and failure analysis techniques (e.g., FMEA).