Staff Site Reliability Engineer
Role details
Job location
Tech stack
Job description
We are hiring a Staff Site Reliability Engineer (SRE) on a 12-month fixed-term contract to act as a senior individual contributor who will accelerate the launch of RWS's new Reliability Engineering function.
This role is a deep technical leadership position, focused on solving systemic reliability challenges, establishing engineering standards, and shaping the architecture and practices needed to improve reliability and observability across our entire estate.
You will work across product engineering, infrastructure operations, platform engineering, enterprise tech, and data teams, leading through expertise, influence, clarity of thinking, and hands-on delivery.
About Product & Technology
Product & Technology plays a pivotal role in aligning the organization with its strategic objectives and enhancing shareholder value. Product & Technology is responsible for establishing unified standards and governance practices throughout the company. Additionally, we oversee the development and maintenance of core applications essential for the seamless operation of various functions across the organization. We are committed to driving and executing future roadmaps that are in line with the overall strategic direction of RWS.
With a global reach, Product & Technology provides support services to over 7500 end users worldwide. We take pride in managing the information security operation and safeguarding all our assets. Our core functions encompass Enterprise & Technical Architecture, Network & Voice, Infrastructure, Service Delivery, Service Operations, Data & Analytics, Security & Quality Compliance, Transformation, Application Development, Enterprise Platforms, With a dedicated team of over 500 staff, Product & Technology ensures a strong presence across all regions, enabling efficient and effective support to our global operations., Technical Reliability Leadership
- Lead technical investigations into reliability, availability, latency, and performance issues across cloud and on-prem systems.
- Define and drive adoption of SLIs, SLOs, and error budgets; establish reliability baselines and engineering standards.
- Provide senior SRE guidance in incident reviews, deep-dives, and long-term remediation, ensuring real root causes are addressed.
- Act as a trusted technical partner to engineering teams, helping them design and operate more resilient systems.
Observability Platform Foundation
- Shape the architecture of RWS's new observability platform through hands-on design, prototyping, and technical decision input.
- Lead the definition of instrumentation standards (metrics, logs, traces, events) using modern, vendor-neutral approaches such as OpenTelemetry.
- Collaborate with platform engineering on tool consolidation, scalable telemetry pipelines, and improved signal quality.
- Build reusable frameworks and components that improve visibility and operational excellence across teams.
Engineering Excellence & Automation
- Identify systemic reliability bottlenecks and design technical solutions - automation, re-architecture, pipelines, guardrails, etc.
- Improve deployment stability, operational readiness, and the quality of services across multiple product lines.
- Introduce resilience techniques such as load testing, chaos testing, and failure-mode analysis where appropriate.
- Produce clear technical documentation, runbooks, and patterns that raise engineering maturity.
Cross-Functional Technical Collaboration
- Work closely with engineering teams across RWS's diverse product and platform landscape to embed SRE thinking.
- Partner with Infrastructure Operations to shape incident detection, response automation, dashboards, and alerting improvements.
- Collaborate with Product, Security, Data, and Enterprise Tech to address cross-cutting reliability work.
- Help define how the future Reliability & Operations organisation should operate from a technical practice perspective.
Requirements
You are a senior-level SRE or platform engineer with strong architectural instincts, deep operational experience, and the ability to lead complex technical improvement initiatives without formal authority.
You will have:
- Substantial hands-on SRE/operational engineering experience across distributed systems.
- Strong expertise in observability (metrics, logging, tracing) and platforms such as Prometheus, Grafana, ELK, Datadog, Splunk, Honeycomb, or equivalents.
- Experience with OpenTelemetry and modern telemetry pipelines.
- Deep knowledge of AWS and/or GCP, Kubernetes/EKS, Linux systems, and CI/CD tooling.
- Ability to analyse complex system behaviour, diagnose issues, and design scalable, pragmatic solutions.
- Strong technical communication skills, able to influence through clarity, evidence, and thoughtful design.