Product Owner - Operational Resilience
Role details
Job location
Tech stack
Job description
Own and evolve a Proactive Resilience product/capability that anticipates, prevents, and mitigates technology and service disruption. You'll translate resilience outcomes (availability, recoverability, performance, operational readiness) into a clear product roadmap, measurable value, and repeatable adoption across platforms and teams., Product strategy & roadmap
-
Define product vision, target users and a prioritised roadmap aligned to business services.
-
Maintain a clear backlog of resilience features Outcome-driven delivery
-
Set OKRs/KPIs for proactive resilience.
-
Maintain a Community of Practice to surface potential resilience improvements, maintained and prioritised via a backlog
Resilience-by-design
-
Embed resilience enhancements into SDLC and change processes (non-functional requirements, release readiness, operational acceptance).
-
Champion practices such as chaos engineering, game days, fault injection, capacity and performance testing, and DR readiness.
Observability & insights
-
Partner with monitoring/observability teams to improve telemetry, alert quality, and actionable dashboards.
-
Use data to identify systemic risks, recurring failure modes, and top offenders across services.
Automation & operational excellence
- Prioritise automation for detection, triage, and remediation.
Stakeholder management
-
Align engineering, operations, architecture, risk, and business stakeholders on resilience priorities.
-
Communicate progress and risk clearly to snr leadership; manage dependencies and delivery risks.
Governance & controls
-
Ensure the product supports relevant operational resilience expectations (eg, impact tolerances, testing evidence, auditability).
-
Maintain documentation, controls evidence, and reporting suitable for risk and assurance audiences.
Required xp & skills
Product ownership/management xp in platform, SRE or operational resilience domains.
Requirements
Operational Resilience
-
SRE principles (SLO/SLI), incident/problem management, and service management.
-
Resilience patterns (redundancy, graceful degradation).
-
DR/BCP concepts (RTO/RPO), high availability, and dependency management.
Data-driven decision-making: ability to use incident, change, and telemetry data to prioritise.
Agile delivery expertise (Scrum/Kanban), backlog management, and stakeholder communication.
Desirable
Familiarity with resilience patterns and platform engineering.
xp running game days/chaos experiments and translating findings into engineering work.
Financial services xp and comfort working with risk, compliance, and audit partners., * Product Ownership
- Product Management
- Operational Resilience
- Technology
- Disaster Recovery
- Resilience
- Proactive Resilience
- Product Roadmapping
- SRE Principles
- SLO
- SLI
- Incident management
- problem management
- service management
- DR
- BCP
- RTO
- RPO
- Dependency Management