Site Reliability Engineer
Role details
Job location
Tech stack
Job description
As a Site Reliability Engineer, you'll help shape and strengthen Evri's cloud infrastructure, reliability engineering practices and operational excellence. You'll work hands-on across AWS, container orchestration, observability platforms, and CI/CD ecosystems to ensure our systems are resilient, secure and optimised for scale.
Your work will be critical in enabling product teams to ship fast, safely and with confidence - all while improving performance, reducing risk, and ensuring Evri remains one of the most reliable logistics platforms in the UK., * Drive architectural and technical decision-making, ensuring infrastructure and platform designs support long-term scalability, reliability and security.
- Partner with Delivery to plan and prioritise platform and infrastructure work for maximum technical and operational impact.
- Mentor engineers and uplift technical capability, championing strong engineering practices and continuous improvement.
- Shape technical strategy by contributing to architectural roadmaps, standards, and patterns-balancing innovation with long-term risk and resilience.
- Embed quality, security, performance and compliance into all engineering designs, processes and operational workflows, ensuring reliability at scale.
Requirements
- 5+ years' experience in a DevOps or SRE role, ideally within AWS-based environments.
- Strong proficiency with AWS CDK and Infrastructure as Code to deploy and optimise cloud infrastructure.
- Hands-on experience with Docker and container orchestration such as Kubernetes (EKS) or Amazon ECS.
- Proven experience building and maintaining CI/CD pipelines using GitLab, Jenkins or similar tooling.
- Deep knowledge of monitoring, observability and logging tools such as Prometheus, Grafana, AppDynamics and OpenSearch.
- Proficiency in Python, TypeScript or Java for building automation, tooling and reliability improvements.
- Solid understanding of cloud security, including WAF, patching, vulnerability management and AWS Shield.
- Working knowledge of message queues and streaming technologies such as RabbitMQ, Kinesis or Kafka.
- Strong analytical and operational problem-solving skills, with the ability to identify performance constraints, eliminate single points of failure and scale distributed systems.
- Experience participating in incident response, including root-cause analysis and driving long-term reliability improvements.
- Excellent communication and collaboration skills to work effectively across architecture, delivery, and engineering teams.