Product Owner - Operational Resilience

TEKsystems

Sheffield, United Kingdom

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Sheffield, United Kingdom

Tech stack

Disaster Recovery

Fault Tolerance

Scrum

Systems Development Life Cycle

Reliability Engineering

Performance Testing

System Availability

Extreme Programming (XP)

Job description

Own and evolve a Proactive Resilience product/capability that anticipates, prevents, and mitigates technology and service disruption. You'll translate resilience outcomes (availability, recoverability, performance, operational readiness) into a clear product roadmap, measurable value, and repeatable adoption across platforms and teams., Product strategy & roadmap

Define product vision, target users and a prioritised roadmap aligned to business services.
Maintain a clear backlog of resilience features Outcome-driven delivery
Set OKRs/KPIs for proactive resilience.
Maintain a Community of Practice to surface potential resilience improvements, maintained and prioritised via a backlog

Resilience-by-design

Embed resilience enhancements into SDLC and change processes (non-functional requirements, release readiness, operational acceptance).
Champion practices such as chaos engineering, game days, fault injection, capacity and performance testing, and DR readiness.

Observability & insights

Partner with monitoring/observability teams to improve telemetry, alert quality, and actionable dashboards.
Use data to identify systemic risks, recurring failure modes, and top offenders across services.

Automation & operational excellence

Prioritise automation for detection, triage, and remediation.

Stakeholder management

Align engineering, operations, architecture, risk, and business stakeholders on resilience priorities.
Communicate progress and risk clearly to snr leadership; manage dependencies and delivery risks.

Governance & controls

Ensure the product supports relevant operational resilience expectations (eg, impact tolerances, testing evidence, auditability).
Maintain documentation, controls evidence, and reporting suitable for risk and assurance audiences.

Required xp & skills

Product ownership/management xp in platform, SRE or operational resilience domains.

Requirements

Operational Resilience

SRE principles (SLO/SLI), incident/problem management, and service management.
Resilience patterns (redundancy, graceful degradation).
DR/BCP concepts (RTO/RPO), high availability, and dependency management.

Data-driven decision-making: ability to use incident, change, and telemetry data to prioritise.

Agile delivery expertise (Scrum/Kanban), backlog management, and stakeholder communication.

Desirable

Familiarity with resilience patterns and platform engineering.

xp running game days/chaos experiments and translating findings into engineering work.

Financial services xp and comfort working with risk, compliance, and audit partners., * Product Ownership

Product Management
Operational Resilience
Technology
Disaster Recovery
Resilience
Proactive Resilience
Product Roadmapping
SRE Principles
SLO
SLI
Incident management
problem management
service management
DR
BCP
RTO
RPO
Dependency Management

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all