Site Reliability Engineer

Outsource UK
Manor Park, United Kingdom
2 days ago

Role details

Contract type
Temporary contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Manor Park, United Kingdom

Tech stack

Java
Artificial Intelligence
Amazon Web Services (AWS)
Systems Engineering
Azure
Computer Programming
Distributed Systems
Python
Reliability Engineering
Prometheus
Datadog
Data Logging
Grafana
Cloudformation
Kubernetes
Information Technology
Terraform

Job description

We are seeking a Senior Site Reliability Engineer to drive the reliability, scalability, and operational excellence of Network commerce systems, including subscriptions and personalization services. You will collaborate closely with product and engineering teams to enhance system architecture, deployment safety, observability, and overall performance., Reliability & Risk Engineering

  • Identify systemic reliability risks and implement preventative solutions.
  • Define and maintain SLIs, SLOs, and error budgets aligned with business and user outcomes.
  • Lead incident management, post-incident reviews, and remediation planning.

Architecture & Resilience

  • Review and advise on system architecture to improve scalability, availability, and fault isolation.
  • Design strategies for high availability, graceful degradation, and disaster recovery across multi-region environments.
  • Quantify trade-offs between performance, cost, and operational risk.

CI/CD & Deployment Safety

  • Enhance deployment pipelines and implement automation to reduce risk and accelerate delivery.
  • Apply safe deployment patterns such as canary, blue/green, and progressive delivery.
  • Ensure robust rollback and recovery mechanisms.

Observability & Performance

  • Build and evolve monitoring, logging, and tracing solutions to provide actionable insights.
  • Collaborate to reduce alert fatigue and improve signal quality.
  • Diagnose performance bottlenecks across infrastructure and applications.

Infrastructure & Automation

  • Operate cloud-native and containerized workloads at scale.
  • Use Infrastructure as Code tools to deploy and manage resilient platforms.
  • Develop automation frameworks to reduce manual toil and operational risk.

Leadership & Mentorship

  • Mentor mid-level engineers and advocate SRE best practices across teams.
  • Partner with engineering, product, and security teams to embed reliability into system design.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or equivalent experience.
  • 7+ years in site reliability, production engineering, or systems engineering roles.
  • Strong understanding of distributed systems, consistency models, failure modes, and fault isolation strategies.
  • Hands-on experience with AWS, GCP, or Azure, including multi-region deployments.
  • Proficiency in Kubernetes and large-scale container orchestration.
  • Programming experience in Go, Python, or Java, building automation or reliability systems.
  • Experience designing and operating CI/CD pipelines with deployment safety guardrails.
  • Proven track record leading high-severity incidents and driving systemic remediation.
  • Excellent interpersonal skills with experience influencing cross-team decisions., * Experience with multi-cloud or multi-region resilience architecture.
  • Proficiency in monitoring and observability tools (Prometheus, Grafana, Datadog).
  • Prior mentorship or technical leadership experience.
  • Familiarity with Infrastructure as Code tools (Terraform, CloudFormation).
  • Experience using AI-assisted tools for incident analysis, operational efficiency, or observability.

Apply for this position