Senior Site Reliability Engineer - Observability, SLOs & Kubernetes Reliability

Htc Inc.
Celebration, United States of America
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Celebration, United States of America

Tech stack

Amazon Web Services (AWS)
Backup Devices
Cloud Computing
Computer Networks
Databases
DevOps
Disaster Recovery
Failover
Routing
Reliability Engineering
Akamai
Runbook
Data Logging
Autoscaling
Istio
System Availability
Grafana
Mttr
Caching
Reliability of Systems
Kubernetes
BIG-IP Access Policy Manager (APM)
Terraform
Splunk
Appdynamics

Job description

We are seeking a senior hands-on Site Reliability Engineer to help improve the reliability, resilience, observability, and operational maturity of Tier-1 business-critical applications.

This is not a traditional DevOps, monitoring, or ticket-based operations role. This person will work across application, cloud, Kubernetes, infrastructure, and operations teams to identify reliability risk, reduce recurring incidents, improve SLO/SLI practices, strengthen observability, and automate repetitive operational work.

The ideal candidate is a true SRE/reliability engineer who has supported high-availability production systems, participated in major incident/RCA processes, improved monitoring and alerting, reduced MTTR, and partnered with engineering teams to prevent repeat failures.

What You Will Own

  • Improve reliability and availability for critical customer-facing and business-critical applications.
  • Define and mature SLOs, SLIs, error budgets, reliability scorecards, and reliability dashboards.
  • Analyze production incidents, Sev-1/Sev-2 trends, RCAs, and recurring failure patterns to drive permanent corrective actions.
  • Improve observability across metrics, events, logs, and traces using tools such as OpenTelemetry, Grafana Cloud, AppDynamics, Splunk, or similar platforms.
  • Review application and platform reliability across AWS, Kubernetes, databases, caching layers, network dependencies, and CDN/traffic routing.
  • Identify single points of failure, resiliency gaps, alerting gaps, scaling risks, and disaster recovery concerns.
  • Partner with application, infrastructure, performance, and operations teams to improve system resilience.
  • Automate repetitive operational tasks and create self-healing workflows where possible.
  • Support performance, scalability, failover, and production readiness reviews.
  • Contribute to chaos engineering and resilience testing efforts using tools such as Gremlin, Harness Chaos Engineering, or similar.
  • Build and improve runbooks, dashboards, governance processes, and reliability improvement roadmaps.

Requirements

Do you have experience in Terraform?, * 8+ years of experience in Site Reliability Engineering, Production Engineering, Platform Engineering, DevOps Engineering, or a closely related reliability-focused role.

  • Strong hands-on experience supporting and improving high-availability production systems.
  • Experience with incident management, problem management, RCA reviews, postmortems, corrective actions, MTTD, and MTTR improvement.
  • Strong understanding of SRE principles including SLOs, SLIs, error budgets, toil reduction, automation, and production readiness.
  • Experience with Kubernetes reliability, workload health, autoscaling, resource requests/limits, probes, and resiliency patterns.
  • Experience with AWS or similar cloud environments, including reliability, high availability, failover, backup, and disaster recovery concepts.
  • Experience with observability, APM, logging, tracing, alerting, and dashboarding.
  • Ability to analyze system performance, latency, throughput, dependencies, bottlenecks, and failure patterns.
  • Hands-on scripting or automation experience to reduce manual operational work.
  • Strong communication skills with the ability to work across application, infrastructure, operations, and product teams.

Preferred / Nice to Have

  • Nobl9 or similar SLO management tooling.
  • OpenTelemetry and Grafana Cloud.
  • AppDynamics, Splunk, Splunk Observability, or similar APM/observability tools.
  • Gremlin or Harness Chaos Engineering.
  • Akamai CDN or similar CDN/traffic routing experience.
  • AWS Well-Architected Framework experience.
  • CSI or similar incident/problem management platform experience.
  • Service mesh, network policy, traffic routing, caching, database HA, or multi-region reliability experience.

Who Will Do Well in This Role

You will be successful in this role if you are the type of engineer who looks beyond alert response and asks: Why did this fail? How do we prevent it from happening again? How do we make the system more observable, more resilient, and less dependent on manual intervention?

This is a strong fit for a senior SRE who wants to influence reliability standards, improve Tier-1 systems, partner closely with engineering teams, and help mature a true SRE practice., * SRE, DevOps, Infrastructure, Platform, or Cloud Operations: 5 years (Required)

  • expert level Kubernetes managing deployments at scale: 3 years (Required)

Benefits & conditions

Pulled from the full job description

  • 401(k)
  • Health insurance
  • 401(k) matching
  • Paid time off
  • Employee discount
  • Vision insurance
  • Health savings account, * 401(k)
  • 401(k) matching
  • Dental insurance
  • Employee assistance program
  • Employee discount
  • Health insurance
  • Health savings account
  • Life insurance
  • Paid time off
  • Relocation assistance
  • Vision insurance

Apply for this position