Staff Backend Engineer - Adaptive Telemetry | UK | Remote
Role details
Job location
Tech stack
Job description
Grafana Cloud is our composable observability platform that integrates metrics, logs, traces, and profiles with Grafana. It allows our customers to leverage the best open source observability software - including Prometheus, Mimir, Loki, Tempo, and Pyroscope - without the overhead of installing, maintaining and scaling their own observability stack.
The Databases department owns and operates the telemetry databases that are Mimir for metrics, Loki for logs, Tempo for traces, and Pyroscope for profiles. We offer our databases as a Cloud service supporting Grafana Cloud.
Adaptive Telemetry Group
The Adaptive Telemetry group, part of the Databases department, has the mission of ensuring that all telemetry stored in our databases is worthy of attention. Under that mission, the group is responsible for the development of Adaptive Metrics, Adaptive Logs, Adaptive Traces and Adaptive Profiles.
Our Adaptive Telemetry solutions give users the ability to control and optimize their telemetry data. These solutions ensure that data storage is optimized based on individual usage patterns, so only the most valuable data is retained.
As a company we are remote-first and global, we embrace people of different experiences and backgrounds to build diverse teams where every person brings a new perspective to the software.
What will you be doing:
- Drive technical strategy and roadmap. Proactively define the architectural vision, prioritize work that unlocks major product or platform improvements, and influence product and engineering decisions.
- Lead end-to-end delivery of large, cross-functional projects. Own planning, design, execution, rollout and long-term operation of large initiatives.
- Own architecture, reliability, performance and cost for critical systems. Make pragmatic architecture choices that balance scalability, availability, latency and cost while ensuring systems remain maintainable and evolvable.
- Define SLOs/SLIs and lead incident response. Establish measurable reliability targets, run high-severity incident response, lead blameless post-mortems, and drive systemic fixes and automation to prevent recurrence.
- Improve observability, automation and operational readiness. Champion telemetry, alerting, runbooks, capacity planning and automation efforts that reduce toil, speed debugging and lower MTTR.
- Align stakeholders and remove blockers. Coordinate across Product, Design and other teams to align priorities, negotiate tradeoffs, and unblock delivery for large initiatives.
- Mentor and grow engineering talent. Coach senior and mid-level engineers, lead design reviews, raise engineering standards, and help teammates make sound technical tradeoffs.
- Represent engineering internally and externally. Communicate technical strategy clearly to non-engineering stakeholders and represent the team in cross-team planning.
We invest heavily in developer productivity. You can use modern AI coding assistants as part of your daily workflow (your choice of tools, within security guidelines), backed by a company-funded usage budget so you can iterate quickly without unnecessary friction.
We encourage pragmatic AI-assisted development: faster prototyping, test generation, refactors, documentation, and incident follow-ups-always paired with strong code review and quality standards.
You'll also have access to frontier models (e.g., GPT-Codex 5/3, Claude Opus 4.6, Gemini 3 Pro).
Requirements
You are a motivated self starter with a bias towards action. You are customer focused. We build everything with our users in mind. You have a passion for creating intuitive products that fit customers' needs
- Proven delivery of large distributed systems. Experience shipping and operating complex systems that span multiple teams, with clear evidence of technical leadership and impact.
- Strong systems-design instincts. Deep understanding of tradeoffs around latency, consistency, availability, scaling and cost.
- Hands-on cloud and platform experience. Solid experience with cloud-native architectures (microservices, containers/Kubernetes, IaC) and the operational practices that keep them healthy.
- Reliability and performance ownership. Comfortable defining SLOs/SLIs, doing capacity planning, tuning performance, and driving reliability work end-to-end.
- Excellent coding and design skills. You write clear, maintainable, well-tested code and can lead technical designs - we use Go, but Python/C/C++/Rust or similar translate well.
- Comfort with AI-assisted development. We embrace AI and agentic development so we expect you to be curious and comfortable using AI-powered developer tools and ideally have practical experience folding them into a team's workflow.
- Experience with messaging and telemetry. Familiarity with streaming/messaging systems (e.g., Kafka) and observability tooling (Prometheus/Grafana or equivalents).
- Influence without authority. Ability to align cross-functional stakeholders, set priorities and drive outcomes in a remote-first environment.
- Strong communicator. Clear written and verbal communication that works across engineers and non-technical stakeholders.
Benefits & conditions
In the UK, the Base compensation range for this role is £100,000 - £121,000. Actual compensation may vary based on level, experience, and skillset as assessed in the interview process. Benefits include equity, bonus (if applicable) and other benefits listed here.