SRE Senior Engineer
Role details
Job location
Tech stack
Job description
As a Senior SRE Engineer, you will:
- Drive the full service lifecycle, from design and deployment to operations and continuous improvement.
- Build and implement foundational SRE capabilities such as SLOs/SLIs, observability platforms, status pages, chaos engineering, toil reduction, automated deployments and intelligent runbooks.
- Champion reliability, automation, and resilience patterns ensuring fault tolerance and exceptional customer experience.
- Engineer and optimise infrastructure, monitoring, and AIOps systems using modern technologies and strong development skills.
- Lead enterprise-level triage during incidents, guide stakeholders, and drive deep post-incident problem management and RCA outcomes.
- Support production operations while contributing to transformation initiatives across on-premise and cloud platforms.
- Collaborate with product and engineering teams to define and maintain SLOs aligned to business outcomes.
- Leverage data engineering and analytics to derive insights, automate decision-making, and improve operational intelligence.
Requirements
Do you have experience in Service-oriented architecture?, We are looking for Senior Software Engineers (SRE) who can apply strong software engineering fundamentals to reliably run and continuously improve our infrastructure and applications. As part of the SRE team, you will design, build, and automate capabilities that uplift operational excellence and customer experience. You will help drive our reliability agenda across infrastructure, applications, data, automation and AIOps, while shaping a culture where engineering and operations work seamlessly together., We are seeking passionate engineers who demonstrate ownership, curiosity, and the ability to solve complex problems with a first-principles approach. Ideal candidates will bring:
- 4-5 years of SRE experience, with 10+ years overall in large-scale enterprise environments (data center + cloud, Azure/AWS preferred).
- Deep technical expertise in one or more areas of full-stack development and production operations, strong advocacy for open-source technologies.
- Experience with monolithic, SOA, microservices, and distributed systems architectures, exposure to transformation programs is a plus.
- Hands-on experience building enterprise observability, integrating metrics, logs, traces, alerts and automation pipelines.
- Strong coding skills in one or more modern languages (e.g., Python, Go, Java, C#, Node.js).
- A strong foundation in performance engineering, scalability, debugging complex production issues and automation at scale.
- Excellent communication skills with the ability to simplify complex technical concepts for diverse audiences.
What Success Looks Like
Success in this role means you are elevating reliability engineering across platforms by maturing observability, improving performance, and reducing toil in meaningful ways. You are shaping stronger operational behaviours within teams, driving automation-first practices, and influencing how services are designed, deployed, and operated. You lead complex incident resolution, create clarity during ambiguity, and ensure post-incident actions translate into lasting improvements. Your work enables product and platform teams to deliver faster, operate with confidence, and consistently meet the reliability expectations of our customers.