Senior Site Reliability Champion

Vanguard

Saint Davids, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Saint Davids, United States of America

Tech stack

Failure Mode Effects Analysis

Monitoring of Systems

Python

Reliability Engineering

Uipath

Cloudwatch

Blue Prism

Splunk

Appdynamics

Dynatrace

Job description

Evaluate applications, platforms, and vendors to assess resiliency, reliability, and operational risk.
Design and implement processes that enforce enterprise resiliency and reliability standards.
Lead blameless post-incident reviews for high-severity incidents or incidents spanning multiple complex product families.
Partner with product and platform teams to proactively identify and remediate reliability risks before they impact clients.
Develop, communicate, and evangelize new standards, tools, and frameworks across subdivisions, ensuring consistent adoption.
Troubleshoot complex production issues and implement durable solutions that prevent recurrence.
Participate in a periodic on-call rotation to support production stability.
Evaluate and onboard resiliency and reliability tooling.
Actively participate in reliability engineering and resilience communities of practice, contributing to shared learning and enterprise consistency.
Contribute to strategic initiatives that advance Vanguard's operational maturity and resiliency posture.

Requirements

Observability Platforms: Experience with modern observability and monitoring tools, such as Splunk, Honeycomb, CloudWatch, Dynatrace, or AppDynamics.
Reliability Metrics: Strong understanding of SLIs, SLOs, and SLAs, including dashboarding and reporting practices.
Monitoring & Alerting: Experience with alert design, anomaly detection, predictive alerting, and synthetic monitoring using structured methodologies.
Automation & Resilience Engineering: Experience with automation and resilience practices such as Python-based automation, RPA platforms (e.g., Blue Prism, UiPath), chaos engineering, and failure analysis techniques (e.g., FMEA).

About the company

About Vanguard At Vanguard, we don't just have a mission-we're on a mission. To work for the long-term financial wellbeing of our clients. To lead through product and services that transform our clients' lives. To learn and develop our skills as individuals and as a team. From Malvern to Melbourne, our mission drives us forward and inspires us to be our best. How We Work Vanguard has implemented a hybrid working model for the majority of our crew members, designed to capture the benefits of enhanced flexibility while enabling in-person learning, collaboration, and connection. We believe our mission-driven and highly collaborative culture is a critical enabler to support long-term client outcomes and enrich the employee experience., Vanguard, one of the world's leading investment management companies, serves individual investors, institutions, employer-sponsored retirement plans, and financial professionals. We have a diverse and talented crew with a culture that promotes teamwork, along with an unwavering focus on serving our clients' best interests. This website uses "cookies" to distinguish you from other users. A cookie is a small file of letters and numbers placed on your computer or device. This helps us to provide you with a good experience when you browse our website and also allows us to improve our site and services. The cookies are stored locally on your computer or mobile device. To accept cookies you can continue browsing as normal. Or you can go to ourPrivacy Policy (https://www.vanguardjobs.com/site-privacy-policy/) to read more information and learn how to change your preferences. Read More

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all