Site Reliability Engineer (SRE)
Role details
Job location
Tech stack
Job description
We're looking for a Site Reliability Engineer (SRE) to help shape and drive how we build and operate reliable, observable, and cost-efficient systems.
You'll work closely with development, platform, and incident management teams to define what "reliable" means in measurable terms - and build the tooling and processes to achieve it.
Your work will directly influence the speed, stability, and scalability of our platform.
Key Responsibilities
Partner with development teams to define and manage SLOs/SLIs, and use error budgets to guide engineering decisions.
Enhance observability - ensuring metrics, logs, and tracing are in place to detect and fix issues proactively.
Lead cost optimisation initiatives: monitor spend, rightsize workloads, tune autoscaling, and drive efficient infrastructure usage.
Strengthen production readiness with pre-deployment checks, post-release validation, and robust platform guardrails.
Introduce and run chaos engineering experiments to improve system resilience.
Automate operational processes to reduce manual intervention across the stack.
Contribute to major incident response, providing engineering expertise.
Collaborate cross-functionally to raise the bar on platform stability, security, and performance.
Requirements
3+ years in SRE, Platform, or DevOps roles.
Strong operational experience with Kubernetes (on-prem and AWS EKS).
Proven track record defining and working with SLOs/SLIs in production environments. Deep understanding of observability (metrics, logging, tracing, telemetry