Site Reliability Engineer (SRE) in Nationwide

Energy Jobline
Olathe, United States of America
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate

Job location

Remote
Olathe, United States of America

Tech stack

API
ARM
Bash
Data Infrastructure
Linux
DevOps
Distributed Systems
Elasticsearch
Python
Reliability Engineering
Ansible
Prometheus
Ruby
Scala
Bare Metal
Kafka
Terraform
Data Pipelines
ELK
Go

Job description

We are looking for a Lead SRE to design, scale, and operate massive-scale observability systems that keep our global services online and performant. You will join an autonomous team of software engineers focused on solving complex data infrastructure challenges., Scale Prometheus metrics infrastructure to handle 100+ million active series.

Operate large Elasticsearch clusters holding 2000+TB of data.

Grow high-throughput Kafka data pipelines processing hundreds of thousands of events per second.

Build custom alerting workflows and self-service APIs for internal engineering teams.

Requirements

5+ years operating mid-to-large distributed systems on Linux VMs or bare-metal machines.

2+ years developing in Go, Python, Ruby, Scala, or Bash.

Hands-on experience with Prometheus/Thanos/Cortex, Kafka, the ELK stack, Ansible, or Consul.

Comfortable diving into unfamiliar codebases and participating in an on-call rotation.

Keywords: Observability, Monitoring, SRE, Site Reliability Engineering, DevOps, ElasticSearch, ELK, Prometheus, Kafka, Terraform, Linux, Bare Metal

Apply for this position