Enterprise Site Reliability Engineer
Role details
Job location
Tech stack
Job description
We are seeking an experienced Enterprise Site Reliability Engineer (SRE) to join a major global financial services client. This role is pivotal in establishing, scaling, and maturing an SRE capability across the enterprise. The ideal candidate will have hands-on experience creating and implementing an SRE function from scratch within a banking or highly regulated financial environment, and be able to embed modern reliability practices across engineering teams. This role is a 12 month contract with prospects to extend and will be hybrid role based in London. Responsibilities:
- Designing, implementing and embedding an enterprise-wide SRE operating model, including SLIs, SLOs and error budgets.
- Establishing SRE processes, governance and tooling from the ground up for a large-scale banking environment.
- Leading platform reliability initiatives and partnering with engineering, architecture and operations teams.
- Automating infrastructure, deployments and operational workflows to reduce toil and improve service stability.
- Implementing enterprise observability standards including logging, metrics, distributed tracing and intelligent alerting.
- Conducting blameless incident reviews and driving systemic improvements across services and platforms.
- Supporting cloud strategy and ensuring reliability best practices across AWS/Azure/GCP environments.
- Leading resilience engineering initiatives including chaos testing, failover scenarios and DR validation.
- Producing high-quality governance, risk and resilience materials for senior stakeholders.
- Mentoring engineering teams to adopt SRE best practices and shift towards a reliability-first culture.
- Supporting audits, regulatory reviews and technical risk assessments.
- Performing analysis related to system performance trends, service gaps and reliability improvements.
Requirements
Do you have experience in Splunk?, * At least 7 years' experience in SRE, production engineering or large-scale platform engineering roles.
- Proven experience building an SRE function from scratch within a bank or regulated financial institution.
- Strong experience with cloud environments (AWS, Azure or GCP) and infrastructure-as-code tooling.
- Hands-on experience with Kubernetes, containers, service mesh technologies and CI/CD platforms.
- Expertise with observability tools such as Prometheus, Grafana, Splunk, Datadog or OpenTelemetry.
- Strong scripting or development skills in Python, Go, Bash or similar.
- Excellent communication and documentation skills, including senior stakeholder management.
- Ability to influence engineering teams and drive cultural change.
- Strong organisational skills and ability to manage multiple priorities.
- Experience in a banking or financial services organisation is essential.
- Knowledge of operational resilience frameworks, risk controls and audit requirements is desirable.