Site Reliability Engineering Lead
Role details
Job location
Tech stack
Job description
We are seeking a Site Reliability Engineering Lead to head a team delivering mission-critical cloud services for a UK Public Sector client. This role combines hands-on technical expertise with leadership responsibilities, ensuring high availability, reliability, and scalability of cloud platforms. You will drive operational excellence, champion automation, and foster collaboration across cross-functional teams to deliver secure, resilient solutions.
Key Responsibilities
Team Leadership & Management
-
Lead, manage and mentor a team of CloudOps engineers, ensuring performance management, career development, and engagement.
-
Manage on-call rota and operational readiness for 24/7 support.
-
Oversee administrative and resource planning tasks.
-
Represent CloudOps in Programme Board, Architecture, Service Reviews & Client Meetings where necessary
Cloud Operations & Automation
-
Design and implement Infrastructure-as-Code (IaC) solutions using tools such as Terraform and Ansible.
-
Automate provisioning, configuration, and scaling of AWS cloud resources.
-
Build and maintain CI/CD pipelines for infrastructure and application deployments.
Platform Reliability & Performance
-
Monitor and troubleshoot cloud services to ensure uptime and rapid incident resolution.
-
Optimise system performance through metrics, dashboards, and proactive tuning.
-
Implement cost optimisation strategies for cloud resource usage.
Application Support
-
Become familiar with the application and service to be able to provide L2 support
-
Co-ordination with Service Management, Engineering & DevOps around application issues
Operational Excellence
-
Develop and maintain disaster recovery and backup strategies.
-
Ensure compliance with security and governance standards, including handling sensitive data (PII/PHI).
-
Maintain comprehensive documentation for infrastructure and operational processes.
Collaboration & Continuous Improvement
-
Partner with QA, Product, and Development teams to enhance service reliability.
-
Drive initiatives to improve time-to-market, quality, and resilience of solutions.
Requirements
-
Proven experience in CloudOps/DevOps/SRE roles (10+ years), with strong leadership capabilities.
-
Skilled in cloud architecture (AWS preferred), Linux environments, and containerisation frameworks.
-
Proficient in Python or similar programming languages.
-
Hands-on experience with IaC tools (Terraform, Ansible) and CI/CD automation.
-
Strong problem-solving skills and ability to work in fast-paced, distributed teams.
-
Eligible for DBS check and UK Security Clearance.
Desirable:
-
- Experience supporting client-facing systems in public sector or healthcare.
-
- Familiarity with secure systems handling sensitive data.
-
- Proactive mindset for identifying operational improvements.