Site Reliability Engineer in Chicago
Role details
Job location
Tech stack
Job description
We're looking for a Site Reliability Engineer to support the availability, performance, and reliability of a next- cloud- platform. You'll collaborate across engineering and infrastructure teams, build automation to reduce toil, improve incident response, and strengthen system resilience through monitoring, metrics, and modern SRE practices.
What You'll Do
-
Partner with development, operations, and infrastructure teams to ensure service availability
-
Build automation to improve incident response and prevent recurring issues
-
Create and enhance runbooks for outages and service degradations
-
Assess production readiness and reliability of new and existing services
-
Define and track operational metrics for performance, scalability, and availability
-
Architect and maintain shared tools that improve reliability across teams
-
Contribute to continuous improvement through research, retrospectives, and code reviews
-
Influence timelines, expectations, and technical direction within the team
-
Mentor junior engineers and help shape sprint planning
Requirements
-
Expert in Building Kubernetes Clusters from scratch
-
Experience supporting and troubleshooting large-scale distributed systems
-
Strong documentation, communication, and analytical problem-solving skills
-
Comfortable working in fast-paced, rapidly changing environments
Technical Skills:
-
Hands-on experience managing cloud infrastructure (AWS)
-
Analysis using tools like Splunk, AppDynamics, Datadog, Prometheus, Grafana
-
Programming/scripting in Java, Python, Bash, or Go
-
Experience with distributed messaging (Kafka, RabbitMQ, ActiveMQ)
-
Container orchestration (Kubernetes, Docker, Rancher)
-
CI/CD tools such as Jenkins, Travis, and Harness
Benefits & conditions
-
15% Bonus
-
20+ days PTO
-
Health, Vison, Dental
-
6% match 401k
-
Technology Stipend
-
Tuition/Training reimbursement program