Senior Site Reliability Engineer

Cox Inc

Austin, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

$ 186K

Job location

Remote

Austin, United States of America

Tech stack

Java

Amazon Web Services (AWS)

Azure

Cloud Computing

Configuration Management

Code Review

Databases

Continuous Integration

DevOps

Distributed Systems

DNS

Fault Tolerance

Github

Identity and Access Management

Python

PostgreSQL

Linux System Administration

MySQL

Networking Basics

Open Source Technology

Oracle Applications

Pair Programming

Redis

Reliability Engineering

Newrelic

Prometheus

Runbook

Software Engineering

TCP/IP

Load Balancing

Delivery Pipeline

Grafana

Cloudformation

Kubernetes

Infrastructure Automation Frameworks

Cassandra

Cloud Optimization

Terraform

Splunk

Pagerduty

ELK

Jenkins

Microservices

Job description

We are looking for a Senior Site Reliability Engineer who is passionate about building and maintaining highly available, scalable, and resilient systems. In this role you will serve as a senior engineer on the SRE team, driving reliability improvements across our production infrastructure while mentoring engineers and shaping our incident response culture.

You will partner closely with software engineering, security, and product teams to embed reliability into every stage of the development lifecycle. This is a high-impact position for someone who thrives at the intersection of software engineering and operations.

Key Responsibilities

Design, build, and maintain production infrastructure across cloud platforms (AWS, GCP, or Azure) ensuring 99.99%+ availability targets
Define and champion SLOs, SLIs, and error budgets; drive data-informed reliability decisions across engineering teams
Lead incident response efforts as Incident Commander; conduct blameless post-mortems and drive remediation to completion
Develop and maintain infrastructure-as-code (Terraform, or CloudFormation) and CI/CD pipelines for automated, repeatable deployments
Build and improve observability platforms using tools such as Prometheus, Grafana, NewRelic,Splunk, or the ELK stack
Automate toil reduction through custom tooling, self-healing systems, and proactive capacity planning
Architect and operate container orchestration systems (Kubernetes, ECS) at scale with emphasis on cost efficiency and performance
Collaborate with security teams to embed security best practices into infrastructure and deployment pipelines
Mentor junior and mid-level SREs through code reviews, knowledge-sharing sessions, and pair-programming
Contribute to the on-call rotation and continuously improve runbooks, alerting, and escalation procedures, Our stack includes Kubernetes, Terraform, AWS, NewRelic,Prometheus and Grafana for monitoring, PagerDuty for on-call, GitHub Actions for CI/CD, and a mix of Java, Go and Python microservices. We are a team that values automation over manual intervention and continuously invest in reducing toil.

Requirements

7+ years of experience in SRE, DevOps, or platform engineering roles with progressive responsibility
Strong proficiency in at least one programming language (Python, Go, Java, or similar) for systems-level automation and tooling
Deep hands-on experience with at least one major cloud provider (AWS, GCP, or Azure) including networking, IAM, and managed services
Expert-level knowledge of container orchestration (Kubernetes) and microservices architectures
Demonstrated experience defining SLOs/SLIs and managing error budgets in production environments
Solid understanding of distributed systems concepts: consensus algorithms, CAP theorem, eventual consistency, and fault-tolerant design
Proficiency with infrastructure-as-code tools (Terraform, CloudFormation) and configuration management
Experience with CI/CD platforms (Jenkins, GitHub Actions) and GitOps workflows
Strong Linux systems administration skills and networking fundamentals (TCP/IP, DNS, load balancing, CDN)
Proven track record of leading incident response, writing effective post-mortems, and implementing systemic fixes

Preferred Qualifications

Familiarity with chaos engineering practices and tools
Background in database reliability engineering (Oracle, PostgreSQL, MySQL, Redis, or Cassandra at scale)
Hands-on experience with FinOps practices and cloud cost optimization
Contributions to open-source SRE or infrastructure projects
Relevant certifications (CKA, AWS Solutions Architect Professional, GCP Professional Cloud Architect)

Benefits & conditions

Competitive base salary with annual performance bonuses
Comprehensive health, dental, and vision insurance with generous employer contribution
Flexible hybrid/remote work model
Annual learning and development budget for conferences, certifications, and courses
Generous PTO policy, paid parental leave, and wellness programs
401(k) with employer match
Collaborative, blameless engineering culture that values continuous improvement

Equal Opportunity Statement

We are an equal opportunity employer committed to creating an inclusive environment for all employees. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other legally protected characteristic.

USD 111,600.00 - 186,000.00 per year, Compensation includes a base salary in the range of $111,600.00 - $186,000.00. The base salary may vary within the anticipated base pay range based on factors such as the ultimate location of the position and the selected candidate's knowledge, skills, and abilities. Position may be eligible for additional compensation that may include an incentive program., The Company offers eligible employees the flexibility to take as much vacation with pay as they deem consistent with their duties, the company's needs, and its obligations; seven paid holidays throughout the calendar year; and up to 160 hours of paid wellness annually for their own wellness or that of family members. Employees are also eligible for additional paid time off in the form of bereavement leave, time off to vote, jury duty leave, volunteer time off, military leave, and parental leave.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all