Senior Site Reliability Engineer - Observability, SLOs & Kubernetes Reliability
Role details
Job location
Tech stack
Job description
We are seeking a senior hands-on Site Reliability Engineer to help improve the reliability, resilience, observability, and operational maturity of Tier-1 business-critical applications.
This is not a traditional DevOps, monitoring, or ticket-based operations role. This person will work across application, cloud, Kubernetes, infrastructure, and operations teams to identify reliability risk, reduce recurring incidents, improve SLO/SLI practices, strengthen observability, and automate repetitive operational work.
The ideal candidate is a true SRE/reliability engineer who has supported high-availability production systems, participated in major incident/RCA processes, improved monitoring and alerting, reduced MTTR, and partnered with engineering teams to prevent repeat failures.
What You Will Own
- Improve reliability and availability for critical customer-facing and business-critical applications.
- Define and mature SLOs, SLIs, error budgets, reliability scorecards, and reliability dashboards.
- Analyze production incidents, Sev-1/Sev-2 trends, RCAs, and recurring failure patterns to drive permanent corrective actions.
- Improve observability across metrics, events, logs, and traces using tools such as OpenTelemetry, Grafana Cloud, AppDynamics, Splunk, or similar platforms.
- Review application and platform reliability across AWS, Kubernetes, databases, caching layers, network dependencies, and CDN/traffic routing.
- Identify single points of failure, resiliency gaps, alerting gaps, scaling risks, and disaster recovery concerns.
- Partner with application, infrastructure, performance, and operations teams to improve system resilience.
- Automate repetitive operational tasks and create self-healing workflows where possible.
- Support performance, scalability, failover, and production readiness reviews.
- Contribute to chaos engineering and resilience testing efforts using tools such as Gremlin, Harness Chaos Engineering, or similar.
- Build and improve runbooks, dashboards, governance processes, and reliability improvement roadmaps.
Requirements
Do you have experience in Terraform?, * 8+ years of experience in Site Reliability Engineering, Production Engineering, Platform Engineering, DevOps Engineering, or a closely related reliability-focused role.
- Strong hands-on experience supporting and improving high-availability production systems.
- Experience with incident management, problem management, RCA reviews, postmortems, corrective actions, MTTD, and MTTR improvement.
- Strong understanding of SRE principles including SLOs, SLIs, error budgets, toil reduction, automation, and production readiness.
- Experience with Kubernetes reliability, workload health, autoscaling, resource requests/limits, probes, and resiliency patterns.
- Experience with AWS or similar cloud environments, including reliability, high availability, failover, backup, and disaster recovery concepts.
- Experience with observability, APM, logging, tracing, alerting, and dashboarding.
- Ability to analyze system performance, latency, throughput, dependencies, bottlenecks, and failure patterns.
- Hands-on scripting or automation experience to reduce manual operational work.
- Strong communication skills with the ability to work across application, infrastructure, operations, and product teams.
Preferred / Nice to Have
- Nobl9 or similar SLO management tooling.
- OpenTelemetry and Grafana Cloud.
- AppDynamics, Splunk, Splunk Observability, or similar APM/observability tools.
- Gremlin or Harness Chaos Engineering.
- Akamai CDN or similar CDN/traffic routing experience.
- AWS Well-Architected Framework experience.
- CSI or similar incident/problem management platform experience.
- Service mesh, network policy, traffic routing, caching, database HA, or multi-region reliability experience.
Who Will Do Well in This Role
You will be successful in this role if you are the type of engineer who looks beyond alert response and asks: Why did this fail? How do we prevent it from happening again? How do we make the system more observable, more resilient, and less dependent on manual intervention?
This is a strong fit for a senior SRE who wants to influence reliability standards, improve Tier-1 systems, partner closely with engineering teams, and help mature a true SRE practice., * SRE, DevOps, Infrastructure, Platform, or Cloud Operations: 5 years (Required)
- expert level Kubernetes managing deployments at scale: 3 years (Required)
Benefits & conditions
Pulled from the full job description
- 401(k)
- Health insurance
- 401(k) matching
- Paid time off
- Employee discount
- Vision insurance
- Health savings account, * 401(k)
- 401(k) matching
- Dental insurance
- Employee assistance program
- Employee discount
- Health insurance
- Health savings account
- Life insurance
- Paid time off
- Relocation assistance
- Vision insurance