Site Reliability Engineer

Outsource UK

Manor Park, United Kingdom

2 days ago

Role details

Contract type

Temporary contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Manor Park, United Kingdom

Tech stack

Java

Artificial Intelligence

Amazon Web Services (AWS)

Systems Engineering

Azure

Computer Programming

Distributed Systems

Python

Reliability Engineering

Prometheus

Datadog

Data Logging

Grafana

Cloudformation

Kubernetes

Information Technology

Terraform

Job description

We are seeking a Senior Site Reliability Engineer to drive the reliability, scalability, and operational excellence of Network commerce systems, including subscriptions and personalization services. You will collaborate closely with product and engineering teams to enhance system architecture, deployment safety, observability, and overall performance., Reliability & Risk Engineering

Identify systemic reliability risks and implement preventative solutions.
Define and maintain SLIs, SLOs, and error budgets aligned with business and user outcomes.
Lead incident management, post-incident reviews, and remediation planning.

Architecture & Resilience

Review and advise on system architecture to improve scalability, availability, and fault isolation.
Design strategies for high availability, graceful degradation, and disaster recovery across multi-region environments.
Quantify trade-offs between performance, cost, and operational risk.

CI/CD & Deployment Safety

Enhance deployment pipelines and implement automation to reduce risk and accelerate delivery.
Apply safe deployment patterns such as canary, blue/green, and progressive delivery.
Ensure robust rollback and recovery mechanisms.

Observability & Performance

Build and evolve monitoring, logging, and tracing solutions to provide actionable insights.
Collaborate to reduce alert fatigue and improve signal quality.
Diagnose performance bottlenecks across infrastructure and applications.

Infrastructure & Automation

Operate cloud-native and containerized workloads at scale.
Use Infrastructure as Code tools to deploy and manage resilient platforms.
Develop automation frameworks to reduce manual toil and operational risk.

Leadership & Mentorship

Mentor mid-level engineers and advocate SRE best practices across teams.
Partner with engineering, product, and security teams to embed reliability into system design.

Requirements

Bachelor's degree in Computer Science, Engineering, or equivalent experience.
7+ years in site reliability, production engineering, or systems engineering roles.
Strong understanding of distributed systems, consistency models, failure modes, and fault isolation strategies.
Hands-on experience with AWS, GCP, or Azure, including multi-region deployments.
Proficiency in Kubernetes and large-scale container orchestration.
Programming experience in Go, Python, or Java, building automation or reliability systems.
Experience designing and operating CI/CD pipelines with deployment safety guardrails.
Proven track record leading high-severity incidents and driving systemic remediation.
Excellent interpersonal skills with experience influencing cross-team decisions., * Experience with multi-cloud or multi-region resilience architecture.
Proficiency in monitoring and observability tools (Prometheus, Grafana, Datadog).
Prior mentorship or technical leadership experience.
Familiarity with Infrastructure as Code tools (Terraform, CloudFormation).
Experience using AI-assisted tools for incident analysis, operational efficiency, or observability.