Observability Platform Engineer (SRE Focus)
Role details
Job location
Tech stack
Job description
We're building a world-class Observability function, and we're looking for someone who lives for uptime, meaningful alerts, and elegant dashboards. If you've ever been on-call, silenced a noisy monitor, or traced a ghost bug across microservices outside core hour - we want to hear from you!
This isn't a generic "Platform Engineer" role. You'll be laser-focused on observability, reliability, and developer empowerment, working closely with teams to make sure we don't just know when things break - but why., * Designing and scaling on-call systems that engineers don't dread being part of.
- Building out Datadog monitoring, alerting, dashboards, and log pipelines for our Kubernetes-based environments.
- Defining and managing SLOs, SLIs, and error budgets - and helping teams stick to them.
- Creating scorecards and software catalogs so engineers know what's healthy, what's broken, and who owns what.
- Training and enabling dev teams to own their own observability, alerts, and incident response.
- Introducing chaos engineering practices (yes, we want to break things… on purpose).
- Driving a culture of reliability, with incident reviews, shared learnings, and transparency.
Requirements
Do you have experience in Terraform?, Do you have a Bachelor's degree?, * Have production experience with observability tools (especially Datadog) in cloud-native environments.
- Have set up monitoring and alerting across Kubernetes services.
- Have built or scaled on-call systems in startups or large-scale environments.
- Know how to reduce alert fatigue and love a good MTTR chart.
- Have experience with infrastructure as code (Terraform preferred).
- Believe that great developer experience includes clear visibility and ownership.
- Are curious about - or already practicing - chaos engineering.
- Have knowledge of our stack: AWS (EKS, Lambda, etc.), Datadog, OpenTelemetry, Terraform, Kubernetes (EKS), Fluent Bit, FireLens, Backstage (or custom)
Desirable:
- Experience with OpenTelemetry, Fluent Bit, or similar.
- Familiarity with service catalog tooling (e.g., Backstage).
- Comfortable running or facilitating game days or failure drills.
- Prior involvement in setting up scorecards for service health.
Benefits & conditions
- A high-quality team that pushes each other to succeed through direct feedback and aligned incentives.
- Strong and transparent team culture, we have each other's backs.
- Independent work environment where results matter.
- Data-driven culture and emphasis on speed (anti-red tape).
We offer a comprehensive benefits package that includes:
- Stock Options
- Private Medical insurance via Vitality and Dental Insurance with BUPA
- EAP with Health Assured
- Enhanced Maternity and Paternity Leave
- Modern and sophisticated office space in Central London
- Free Gym in office building in Holborn
- Subsidised Lunch via Feedr
- Deliveroo Allowance if working late in office
- Monthly in office Masseuse
- Team and Company Socials
- Football Power League / Paddle and Yoga Club