Site Reliability Engineer - UKIC (South)

SR2
Jacobstowe, United Kingdom
2 days ago

Role details

Contract type
Contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Jacobstowe, United Kingdom

Tech stack

Amazon Web Services (AWS)
Azure
Bash
Cloud Computing
Continuous Integration
Linux
DevOps
Disaster Recovery
Monitoring of Systems
Python
Live Connect (Windows)
Networking Basics
Release Management
Reliability Engineering
Prometheus
Datadog
Scripting (Bash/Python/Go/Ruby)
Grafana
Deployment Automation
Terraform
Splunk
Docker
Go

Job description

Site Reliability Engineer

Location: South of the UK/Hybrid Clearance: Must hold active UKIC (South) clearance The Role

We're supporting a requirement for a number of Site Reliability Engineers to join a high-assurance engineering environment delivering secure, resilient digital services within a sensitive UK government setting.

This role would suit someone with a strong background in production reliability, platform operations, automation, observability, and incident response, who is comfortable working across complex cloud-based or hybrid infrastructure. You'll play a key role in ensuring services are robust, scalable and supportable, while driving improvements in reliability, performance and operational maturity. Responsibilities

  • Support the availability, performance and resilience of critical live services
  • Build and improve automation across operational processes, deployments and platform management
  • Design and maintain monitoring, alerting and observability tooling across services and infrastructure
  • Troubleshoot complex incidents, conduct root cause analysis, and implement preventative improvements
  • Work closely with engineering, platform and delivery teams to improve reliability and reduce operational risk
  • Contribute to capacity planning, service scaling, failover readiness and disaster recovery approaches
  • Help shape SRE best practice, including SLIs, SLOs, error budgets and operational standards
  • Support continuous improvement across CI/CD, release management and operational tooling

Experience Required

  • Strong experience in a Site Reliability Engineering, DevOps, or production support engineering role
  • Experience supporting business-critical live services in secure or complex environments
  • Strong understanding of Linux/Unix systems, networking fundamentals, and infrastructure troubleshooting
  • Experience with cloud platforms such as AWS, Azure or GCP
  • Hands-on experience with Infrastructure as Code, ideally Terraform or similar tooling
  • Experience with containers and orchestration, such as Docker and Kubernetes
  • Knowledge of CI/CD tooling and automated deployment pipelines
  • Strong experience with monitoring and observability tools such as Prometheus, Grafana, ELK, Datadog, Splunk or similar
  • Scripting or coding capability in tools/languages such as Python, Bash, Go or similar
  • Strong incident management, problem-solving and stakeholder communication skills

Requirements

  • Strong experience in a Site Reliability Engineering, DevOps, or production support engineering role
  • Experience supporting business-critical live services in secure or complex environments
  • Strong understanding of Linux/Unix systems, networking fundamentals, and infrastructure troubleshooting
  • Experience with cloud platforms such as AWS, Azure or GCP
  • Hands-on experience with Infrastructure as Code, ideally Terraform or similar tooling
  • Experience with containers and orchestration, such as Docker and Kubernetes
  • Knowledge of CI/CD tooling and automated deployment pipelines
  • Strong experience with monitoring and observability tools such as Prometheus, Grafana, ELK, Datadog, Splunk or similar
  • Scripting or coding capability in tools/languages such as Python, Bash, Go or similar
  • Strong incident management, problem-solving and stakeholder communication skills

Apply for this position