Staff Site Reliability Engineer

Obsidian Security

Manchester, United Kingdom

17 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Manchester, United Kingdom

Tech stack

Amazon Web Services (AWS)

Software as a Service

Continuous Integration

Software Debugging

DevOps

Distributed Systems

Reliability Engineering

Prometheus

Grafana

Gitlab-ci

Kubernetes

Data Pipelines

Legacy Systems

Job description

As a Staff SRE at Obsidian, you will define and drive the company-wide reliability vision for a complex, multi-tenant SaaS platform serving enterprise and financial customers. You will operate as a strategic partner to DevOps and Platform Engineering leadership, shaping a unified reliability strategy that scales across the organization.

Your core mandate: ensure Obsidian detects, diagnoses, and communicates system issues before customers are impacted-consistently and predictably.

This is a hands-on technical role that involves architecting and leading the implementation of systems that handle real-world complexity, including upstream SaaS dependencies, sparse and noisy signals, and mission-critical enterprise workloads., * Reliability Strategy & Architecture - Define and lead long-term reliability strategy across services. Establish end-to-end system visibility frameworks and guide architecture for observability, detection, and resilience.

Cross-Org Leadership - Partner across teams to embed reliability, standardize SLI/SLOs, and serve as a technical escalation expert.
Detection & Observability - Build intelligent detection systems (anomaly detection, connector health models) and enable self-service observability.
Incident Management - Define and evolve a tiered incident communication strategy, improve response practices, and lead postmortems to strengthen reliability and customer trust.
Execution - Contribute hands-on to system design, monitoring, and debugging across distributed systems and data pipelines.

Requirements

Do you have experience in SaaS?, * 5+ years in SRE, Production Engineering, or related roles

3+ years operating at a senior or technical leadership level (Staff or equivalent scope)
Deep expertise in:

AWS and/or GCP
Kubernetes and Helm
Observability stacks (Prometheus, Grafana, or equivalent)
CI/CD systems (GitLab CI/CD, ArgoCD, etc.)

Proven experience designing and scaling reliability systems for multi-tenant SaaS platforms
Strong debugging and systems thinking across distributed microservices and legacy systems
Demonstrated ability to lead initiatives that improve incident detection, response, and system resilience
Hands-on engineering approach with a track record of building-not just configuring-reliability systems, * Experience in B2B SaaS serving enterprise or financial customers
Familiarity with third-party SaaS connector architectures and ingestion patterns
Experience building anomaly detection or intelligent alerting systems
Experience designing customer-facing status pages and incident communication frameworks

Benefits & conditions

Why This Role

Drive org-wide reliability strategy
Own and build new detection & observability systems
Tackle complex distributed systems challenges
Safeguard critical infrastructure for financial customers

What Success Looks Like

Issues caught and resolved before customer impact
Reliability is measurable and continuously improving
Teams self-serve observability with scalable tools
Clear, proactive incident communication builds trust
Reliability becomes a competitive advantage

About the company

Founded in 2017, Obsidian Security was created to close a critical gap: securing the SaaS applications where modern business happens-platforms like Microsoft 365, Salesforce, and hundreds more. Backed by top investors including Greylock, Norwest Venture Partners, and IVP, we've built a complete SaaS security platform to reduce risk, detect and respond to threats, and prevent breaches at the source. Our team includes leaders who helped define the categories of endpoint and identity security at CrowdStrike, Okta, Cylance, and Carbon Black. Now, we're transforming how SaaS is secured-in the era of agentic AI. Today, Obsidian is trusted by global enterprises like Snowflake, T-Mobile, and Pure Storage. We protect more than 200 organizations across North America, Europe, the Middle East, Southeast Asia, Australia, and New Zealand-including many of the world's largest Fortune 1000 and Global 2000 companies. With strong global momentum, a growing partner ecosystem including SentinelOne, Databricks, and Google Cloud, and a major fundraise on the horizon, we're scaling quickly toward long-term growth and IPO readiness. Join us as we define the future of SaaS security!

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

About the company

Apply for this position

Good distractions

Moments

Videos View all