Staff Site Reliability Engineer
Role details
Job location
Tech stack
Job description
As a Staff SRE at Obsidian, you will define and drive the company-wide reliability vision for a complex, multi-tenant SaaS platform serving enterprise and financial customers. You will operate as a strategic partner to DevOps and Platform Engineering leadership, shaping a unified reliability strategy that scales across the organization.
Your core mandate: ensure Obsidian detects, diagnoses, and communicates system issues before customers are impacted-consistently and predictably.
This is a hands-on technical role that involves architecting and leading the implementation of systems that handle real-world complexity, including upstream SaaS dependencies, sparse and noisy signals, and mission-critical enterprise workloads., * Reliability Strategy & Architecture - Define and lead long-term reliability strategy across services. Establish end-to-end system visibility frameworks and guide architecture for observability, detection, and resilience.
- Cross-Org Leadership - Partner across teams to embed reliability, standardize SLI/SLOs, and serve as a technical escalation expert.
- Detection & Observability - Build intelligent detection systems (anomaly detection, connector health models) and enable self-service observability.
- Incident Management - Define and evolve a tiered incident communication strategy, improve response practices, and lead postmortems to strengthen reliability and customer trust.
- Execution - Contribute hands-on to system design, monitoring, and debugging across distributed systems and data pipelines.
Requirements
Do you have experience in SaaS?, * 5+ years in SRE, Production Engineering, or related roles
- 3+ years operating at a senior or technical leadership level (Staff or equivalent scope)
- Deep expertise in:
- AWS and/or GCP
- Kubernetes and Helm
- Observability stacks (Prometheus, Grafana, or equivalent)
- CI/CD systems (GitLab CI/CD, ArgoCD, etc.)
- Proven experience designing and scaling reliability systems for multi-tenant SaaS platforms
- Strong debugging and systems thinking across distributed microservices and legacy systems
- Demonstrated ability to lead initiatives that improve incident detection, response, and system resilience
- Hands-on engineering approach with a track record of building-not just configuring-reliability systems, * Experience in B2B SaaS serving enterprise or financial customers
- Familiarity with third-party SaaS connector architectures and ingestion patterns
- Experience building anomaly detection or intelligent alerting systems
- Experience designing customer-facing status pages and incident communication frameworks
Benefits & conditions
Why This Role
- Drive org-wide reliability strategy
- Own and build new detection & observability systems
- Tackle complex distributed systems challenges
- Safeguard critical infrastructure for financial customers
What Success Looks Like
- Issues caught and resolved before customer impact
- Reliability is measurable and continuously improving
- Teams self-serve observability with scalable tools
- Clear, proactive incident communication builds trust
- Reliability becomes a competitive advantage