Staff SRE, Ads

gb Reddit Inc.
Redruth, United Kingdom
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Remote
Redruth, United Kingdom

Tech stack

Big Data
Google BigQuery
Cloud Computing
Cloud Engineering
Computer Programming
Linux
Distributed Systems
Monitoring of Systems
Python
Machine Learning
Performance Tuning
Recommender Systems
Reliability Engineering
Data Logging
Spark
Build Management
Kubernetes
Apache Flink
Kafka
Vertica

Job description

  • Lead reliability initiatives across multiple Ads domains including ad serving, auctions, targeting, reporting, measurement, and billing.
  • Partner with engineering leadership to improve reliability, scalability, operational excellence, and engineering efficiency across the Ads organization.
  • Drive architecture reviews and influence technical decisions impacting critical revenue-generating systems.
  • Design and build platforms, tooling, and automation that improve reliability and developer productivity at scale.
  • Participate in on-call rotations, lead complex incident investigations and coordinate cross-functional response efforts during major production events.
  • Identify systemic reliability risks and drive long-term solutions that improve platform resilience.
  • Establish reliability metrics around advertiser-critical user journeys such as campaign creation, ad delivery, auction participation, reporting, attribution, and billing.
  • Mentor engineers and provide technical leadership across multiple teams.
  • Influence roadmap planning and ensure reliability considerations are incorporated into product and infrastructure investments.

Requirements

  • 8+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large scale distributed systems.
  • Strong experience supporting high traffic, user facing production environments.
  • Deep understanding of distributed systems, networking, Linux systems, cloud native architectures.
  • Experience designing highly available systems with strong operational and reliability practices.
  • Strong understanding of observability systems including metrics, logging, tracing, and alerting.
  • Good programming skills in languages such as Go, Python, or similar.
  • Experience improving reliability through SLOs, automation, incident management, and performance optimization.
  • Demonstrated ability to troubleshoot complex issues across a modern distributed system stack.
  • Strong collaboration and communication skills with the ability to influence technical direction across teams., * Experience supporting advertising technology platforms or other large-scale revenue-critical systems.
  • Deep understanding of reliability challenges associated with ad-serving, real-time auctions, budget pacing, campaign delivery, measurement, attribution, or billing systems.
  • Experience operating high-QPS, low-latency services where latency directly impacts business outcomes.
  • Experience establishing reliability programs that deliver meaningful, measurable business outcomes
  • Experience with Kubernetes, cloud infrastructure, and large-scale distributed systems.
  • Familiarity with Kafka, ClickHouse, Spark, Flink, BigQuery, or similar large-scale data platforms.
  • Experience partnering with Product, Data Science, and Ads Engineering organizations.
  • Experience supporting machine learning inference or recommendation systems at scale.

Benefits & conditions

  • Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
  • Family Planning Support
  • Gender-Affirming Care
  • Mental Health & Coaching Benefits
  • Group Personal Pension Scheme with Employer match
  • Private Medical and Dental Scheme
  • Income Replacement Programs
  • Bike to Work scheme
  • Flexible Vacation & Paid Volunteer Time Off
  • Generous Paid Parental Leave

About the company

Reddit is a community of communities. It's built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet. Every day, Reddit users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 126 million daily active unique visitors, Reddit is one of the internet's largest sources of information. For more information, visit www.redditinc.com. Location: Reddit has a flexible first workforce. Don't live near our office? No worries: you can work remotely from anywhere in the UK, the Netherlands or Ireland. The Ads organization powers Reddit's advertising platform, enabling advertisers to reach highly engaged communities while helping Reddit grow its business. The reliability of our Ads systems directly impacts advertiser success, revenue generation, and user experience. The Ads Reliability team partners closely with Ads Engineering teams to improve reliability, scalability, operational excellence, and developer productivity across Reddit's advertising ecosystem. We're looking for a Staff Site Reliability Engineer who will provide technical leadership for reliability initiatives across the Ads organization and help shape the future of Ads infrastructure at Reddit.

Apply for this position