Site Reliability Engineer
Role details
Job location
Tech stack
Job description
- Design and develop dashboards to monitor application health, performance, and key business metrics. Hands-on technical expertise with implementing SLAs/SLOs/SLIs for a range of microservices and data pipelines.
- Support systems that serve millions of customers and billions of requests monthly, ensuring their availability, scalability and resiliency.
- Act as a key technical individual contributor within PEC and liaising with SRE guilds, driving improvements to our cloud deployments, monitoring solutions, CI/CD pipelines and optimising cost.
- Automate monitoring, alerting, and reporting to improve system observability and reduce manual effort.
- Collaborate with engineering, operations, and business teams to ensure platform stability and proactive issue resolution.
- Analyse performance trends and provide insights to drive continuous improvement.
- Support capacity planning, disaster recovery, and compliance activities. Implementing tooling that allows the business to perform triage of incidents more efficiently, have more granular alerting, well-defined runbooks and auto-resolving mechanisms.
Requirements
The ideal candidate will have a strong background in one or multiple fields including SRE and software engineering. In addition, the candidate will have experience supporting applications at scale, serving high-throughput, having had built dashboards and driving Site Reliability Engineering (SRE) practices to keep our complex hybrid-cloud solutions resilient and efficient. An engineering mindset and experience working with large complex organisations are preferable., * Production experience with k8s and monitoring tools such as Datadog/Dynatrace/etc.
- Proven experience of running post-mortems, defining SLAs/SLIs/SLOs and participating in support rotas.
- Extensive experience of Cloud native solutions (ideally Google Cloud).
- Proven experience and knowledge of automation and CI/CD and best practices.
- Proven experience of running post-mortems, defining SLAs/SLIs/SLOs and participating in support rotas.
And any experience of these would be really useful:
- Familiarity with Pega CDH or similar decisioning platforms.
- Coding/scripting experience developed in a commercial/industry setting (python/bash).
- Proficient with Kubernetes (ideally microservice architectures using istio service mesh).
Benefits & conditions
We're on an exciting journey to transform our Group and the way we're shaping finance for good. We're focusing on the future, investing in our technologies, workplaces, and colleagues to make our Group a great place for everyone. Including you.
Our focus is to ensure we're inclusive every day, building an organisation that reflects modern society and celebrates diversity in all its forms.
We also offer a wide-ranging benefits package, which includes:
- A generous pension contribution of up to 15%
- An annual performance-related bonus
- Share schemes including free shares
- Benefits you can adapt to your lifestyle, such as discounted shopping
- 30 days' holiday, with bank holidays on top
- A range of wellbeing initiatives and generous parental leave policies