Site Reliability Engineer - UKIC (South)
Role details
Job location
Tech stack
Job description
Site Reliability Engineer
Location: South of the UK/Hybrid Clearance: Must hold active UKIC (South) clearance The Role
We're supporting a requirement for a number of Site Reliability Engineers to join a high-assurance engineering environment delivering secure, resilient digital services within a sensitive UK government setting.
This role would suit someone with a strong background in production reliability, platform operations, automation, observability, and incident response, who is comfortable working across complex cloud-based or hybrid infrastructure. You'll play a key role in ensuring services are robust, scalable and supportable, while driving improvements in reliability, performance and operational maturity. Responsibilities
- Support the availability, performance and resilience of critical live services
- Build and improve automation across operational processes, deployments and platform management
- Design and maintain monitoring, alerting and observability tooling across services and infrastructure
- Troubleshoot complex incidents, conduct root cause analysis, and implement preventative improvements
- Work closely with engineering, platform and delivery teams to improve reliability and reduce operational risk
- Contribute to capacity planning, service scaling, failover readiness and disaster recovery approaches
- Help shape SRE best practice, including SLIs, SLOs, error budgets and operational standards
- Support continuous improvement across CI/CD, release management and operational tooling
Experience Required
- Strong experience in a Site Reliability Engineering, DevOps, or production support engineering role
- Experience supporting business-critical live services in secure or complex environments
- Strong understanding of Linux/Unix systems, networking fundamentals, and infrastructure troubleshooting
- Experience with cloud platforms such as AWS, Azure or GCP
- Hands-on experience with Infrastructure as Code, ideally Terraform or similar tooling
- Experience with containers and orchestration, such as Docker and Kubernetes
- Knowledge of CI/CD tooling and automated deployment pipelines
- Strong experience with monitoring and observability tools such as Prometheus, Grafana, ELK, Datadog, Splunk or similar
- Scripting or coding capability in tools/languages such as Python, Bash, Go or similar
- Strong incident management, problem-solving and stakeholder communication skills
Requirements
- Strong experience in a Site Reliability Engineering, DevOps, or production support engineering role
- Experience supporting business-critical live services in secure or complex environments
- Strong understanding of Linux/Unix systems, networking fundamentals, and infrastructure troubleshooting
- Experience with cloud platforms such as AWS, Azure or GCP
- Hands-on experience with Infrastructure as Code, ideally Terraform or similar tooling
- Experience with containers and orchestration, such as Docker and Kubernetes
- Knowledge of CI/CD tooling and automated deployment pipelines
- Strong experience with monitoring and observability tools such as Prometheus, Grafana, ELK, Datadog, Splunk or similar
- Scripting or coding capability in tools/languages such as Python, Bash, Go or similar
- Strong incident management, problem-solving and stakeholder communication skills