Senior Site Reliability Engineer - Observability, SLOs & Kubernetes Reliability

Htc Inc.

Celebration, United States of America

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Celebration, United States of America

Tech stack

Amazon Web Services (AWS)

Backup Devices

Cloud Computing

Computer Networks

Databases

DevOps

Disaster Recovery

Failover

Routing

Reliability Engineering

Akamai

Runbook

Data Logging

Autoscaling

Istio

System Availability

Grafana

Mttr

Caching

Reliability of Systems

Kubernetes

BIG-IP Access Policy Manager (APM)

Terraform

Splunk

Appdynamics

Job description

We are seeking a senior hands-on Site Reliability Engineer to help improve the reliability, resilience, observability, and operational maturity of Tier-1 business-critical applications.

This is not a traditional DevOps, monitoring, or ticket-based operations role. This person will work across application, cloud, Kubernetes, infrastructure, and operations teams to identify reliability risk, reduce recurring incidents, improve SLO/SLI practices, strengthen observability, and automate repetitive operational work.

The ideal candidate is a true SRE/reliability engineer who has supported high-availability production systems, participated in major incident/RCA processes, improved monitoring and alerting, reduced MTTR, and partnered with engineering teams to prevent repeat failures.

What You Will Own

Improve reliability and availability for critical customer-facing and business-critical applications.
Define and mature SLOs, SLIs, error budgets, reliability scorecards, and reliability dashboards.
Analyze production incidents, Sev-1/Sev-2 trends, RCAs, and recurring failure patterns to drive permanent corrective actions.
Improve observability across metrics, events, logs, and traces using tools such as OpenTelemetry, Grafana Cloud, AppDynamics, Splunk, or similar platforms.
Review application and platform reliability across AWS, Kubernetes, databases, caching layers, network dependencies, and CDN/traffic routing.
Identify single points of failure, resiliency gaps, alerting gaps, scaling risks, and disaster recovery concerns.
Partner with application, infrastructure, performance, and operations teams to improve system resilience.
Automate repetitive operational tasks and create self-healing workflows where possible.
Support performance, scalability, failover, and production readiness reviews.
Contribute to chaos engineering and resilience testing efforts using tools such as Gremlin, Harness Chaos Engineering, or similar.
Build and improve runbooks, dashboards, governance processes, and reliability improvement roadmaps.

Requirements

Do you have experience in Terraform?, * 8+ years of experience in Site Reliability Engineering, Production Engineering, Platform Engineering, DevOps Engineering, or a closely related reliability-focused role.

Strong hands-on experience supporting and improving high-availability production systems.
Experience with incident management, problem management, RCA reviews, postmortems, corrective actions, MTTD, and MTTR improvement.
Strong understanding of SRE principles including SLOs, SLIs, error budgets, toil reduction, automation, and production readiness.
Experience with Kubernetes reliability, workload health, autoscaling, resource requests/limits, probes, and resiliency patterns.
Experience with AWS or similar cloud environments, including reliability, high availability, failover, backup, and disaster recovery concepts.
Experience with observability, APM, logging, tracing, alerting, and dashboarding.
Ability to analyze system performance, latency, throughput, dependencies, bottlenecks, and failure patterns.
Hands-on scripting or automation experience to reduce manual operational work.
Strong communication skills with the ability to work across application, infrastructure, operations, and product teams.

Preferred / Nice to Have

Nobl9 or similar SLO management tooling.
OpenTelemetry and Grafana Cloud.
AppDynamics, Splunk, Splunk Observability, or similar APM/observability tools.
Gremlin or Harness Chaos Engineering.
Akamai CDN or similar CDN/traffic routing experience.
AWS Well-Architected Framework experience.
CSI or similar incident/problem management platform experience.
Service mesh, network policy, traffic routing, caching, database HA, or multi-region reliability experience.

Who Will Do Well in This Role

You will be successful in this role if you are the type of engineer who looks beyond alert response and asks: Why did this fail? How do we prevent it from happening again? How do we make the system more observable, more resilient, and less dependent on manual intervention?

This is a strong fit for a senior SRE who wants to influence reliability standards, improve Tier-1 systems, partner closely with engineering teams, and help mature a true SRE practice., * SRE, DevOps, Infrastructure, Platform, or Cloud Operations: 5 years (Required)

expert level Kubernetes managing deployments at scale: 3 years (Required)

Benefits & conditions

Pulled from the full job description

401(k)
Health insurance
401(k) matching
Paid time off
Employee discount
Vision insurance
Health savings account, * 401(k)
401(k) matching
Dental insurance
Employee assistance program
Employee discount
Health insurance
Health savings account
Life insurance
Paid time off
Relocation assistance
Vision insurance

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all