Product Owner - Operational Resilience

TEKsystems
Sheffield, United Kingdom
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Sheffield, United Kingdom

Tech stack

Disaster Recovery
Fault Tolerance
Scrum
Systems Development Life Cycle
Reliability Engineering
Performance Testing
System Availability
Extreme Programming (XP)

Job description

Own and evolve a Proactive Resilience product/capability that anticipates, prevents, and mitigates technology and service disruption. You'll translate resilience outcomes (availability, recoverability, performance, operational readiness) into a clear product roadmap, measurable value, and repeatable adoption across platforms and teams., Product strategy & roadmap

  • Define product vision, target users and a prioritised roadmap aligned to business services.

  • Maintain a clear backlog of resilience features Outcome-driven delivery

  • Set OKRs/KPIs for proactive resilience.

  • Maintain a Community of Practice to surface potential resilience improvements, maintained and prioritised via a backlog

Resilience-by-design

  • Embed resilience enhancements into SDLC and change processes (non-functional requirements, release readiness, operational acceptance).

  • Champion practices such as chaos engineering, game days, fault injection, capacity and performance testing, and DR readiness.

Observability & insights

  • Partner with monitoring/observability teams to improve telemetry, alert quality, and actionable dashboards.

  • Use data to identify systemic risks, recurring failure modes, and top offenders across services.

Automation & operational excellence

  • Prioritise automation for detection, triage, and remediation.

Stakeholder management

  • Align engineering, operations, architecture, risk, and business stakeholders on resilience priorities.

  • Communicate progress and risk clearly to snr leadership; manage dependencies and delivery risks.

Governance & controls

  • Ensure the product supports relevant operational resilience expectations (eg, impact tolerances, testing evidence, auditability).

  • Maintain documentation, controls evidence, and reporting suitable for risk and assurance audiences.

Required xp & skills

Product ownership/management xp in platform, SRE or operational resilience domains.

Requirements

Operational Resilience

  • SRE principles (SLO/SLI), incident/problem management, and service management.

  • Resilience patterns (redundancy, graceful degradation).

  • DR/BCP concepts (RTO/RPO), high availability, and dependency management.

Data-driven decision-making: ability to use incident, change, and telemetry data to prioritise.

Agile delivery expertise (Scrum/Kanban), backlog management, and stakeholder communication.

Desirable

Familiarity with resilience patterns and platform engineering.

xp running game days/chaos experiments and translating findings into engineering work.

Financial services xp and comfort working with risk, compliance, and audit partners., * Product Ownership

  • Product Management
  • Operational Resilience
  • Technology
  • Disaster Recovery
  • Resilience
  • Proactive Resilience
  • Product Roadmapping
  • SRE Principles
  • SLO
  • SLI
  • Incident management
  • problem management
  • service management
  • DR
  • BCP
  • RTO
  • RPO
  • Dependency Management

Apply for this position