Site Reliability Engineer - UKIC (South)

SR2

Jacobstowe, United Kingdom

2 days ago

Role details

Contract type

Contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Jacobstowe, United Kingdom

Tech stack

Amazon Web Services (AWS)

Azure

Bash

Cloud Computing

Continuous Integration

Linux

DevOps

Disaster Recovery

Monitoring of Systems

Python

Live Connect (Windows)

Networking Basics

Release Management

Reliability Engineering

Prometheus

Datadog

Scripting (Bash/Python/Go/Ruby)

Grafana

Deployment Automation

Terraform

Splunk

Docker

Job description

Site Reliability Engineer

Location: South of the UK/Hybrid Clearance: Must hold active UKIC (South) clearance The Role

We're supporting a requirement for a number of Site Reliability Engineers to join a high-assurance engineering environment delivering secure, resilient digital services within a sensitive UK government setting.

This role would suit someone with a strong background in production reliability, platform operations, automation, observability, and incident response, who is comfortable working across complex cloud-based or hybrid infrastructure. You'll play a key role in ensuring services are robust, scalable and supportable, while driving improvements in reliability, performance and operational maturity. Responsibilities

Support the availability, performance and resilience of critical live services
Build and improve automation across operational processes, deployments and platform management
Design and maintain monitoring, alerting and observability tooling across services and infrastructure
Troubleshoot complex incidents, conduct root cause analysis, and implement preventative improvements
Work closely with engineering, platform and delivery teams to improve reliability and reduce operational risk
Contribute to capacity planning, service scaling, failover readiness and disaster recovery approaches
Help shape SRE best practice, including SLIs, SLOs, error budgets and operational standards
Support continuous improvement across CI/CD, release management and operational tooling

Experience Required

Strong experience in a Site Reliability Engineering, DevOps, or production support engineering role
Experience supporting business-critical live services in secure or complex environments
Strong understanding of Linux/Unix systems, networking fundamentals, and infrastructure troubleshooting
Experience with cloud platforms such as AWS, Azure or GCP
Hands-on experience with Infrastructure as Code, ideally Terraform or similar tooling
Experience with containers and orchestration, such as Docker and Kubernetes
Knowledge of CI/CD tooling and automated deployment pipelines
Strong experience with monitoring and observability tools such as Prometheus, Grafana, ELK, Datadog, Splunk or similar
Scripting or coding capability in tools/languages such as Python, Bash, Go or similar
Strong incident management, problem-solving and stakeholder communication skills

Requirements

Strong experience in a Site Reliability Engineering, DevOps, or production support engineering role
Experience supporting business-critical live services in secure or complex environments
Strong understanding of Linux/Unix systems, networking fundamentals, and infrastructure troubleshooting
Experience with cloud platforms such as AWS, Azure or GCP
Hands-on experience with Infrastructure as Code, ideally Terraform or similar tooling
Experience with containers and orchestration, such as Docker and Kubernetes
Knowledge of CI/CD tooling and automated deployment pipelines
Strong experience with monitoring and observability tools such as Prometheus, Grafana, ELK, Datadog, Splunk or similar
Scripting or coding capability in tools/languages such as Python, Bash, Go or similar
Strong incident management, problem-solving and stakeholder communication skills