Site Reliability Engineer
Role details
Job location
Tech stack
Job description
-
Establish and improve SLOs, SLIs, and SLAs across services; partner with engineering teams to embed reliability targets into product designs.
-
Build and evolve monitoring, alerting, and tracing systems to ensure rapid detection and resolution of issues.
-
Develop incident response processes, oncall rotations, and postmortem practices that drive continuous improvement.
-
Implement automation for deployment pipelines, failover, scaling, and capacity planning to reduce manual operations and error risk.
-
Champion security and compliancedriven infrastructure, including secrets management, secure networking, and audit readiness.
-
Collaborate on disaster recovery strategies and resilience testing (chaos engineering, load testing, rolling updates, blue/green deployments).
-
Partner with developers to identify performance bottlenecks, optimize services, and reduce infrastructure costs.
-
Contribute to internal tooling and developer experience to accelerate safe delivery of features in production.
Requirements
-
5+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles for distributed systems at scale.
-
Deep expertise with Kubernetes, container orchestration, and service meshes in production environments.
-
Strong skills in observability tooling (Prometheus, Grafana, OpenTelemetry, etc.) and incident management systems.
-
Experience designing HA/DR architectures, managing multiregion deployments, and optimizing for lowlatency traffic flows.
-
Proficiency with cloud platforms (AWS/GCP/Azure) and infrastructureascode (Terraform, Helm).
-
Security and compliance mindset, comfortable with regulated environments (HIPAA/GDPR) and auditing requirements.
-
Excellent crossfunctional communication and collaboration skills.
PREFERRED QUALIFICATIONS
-
Experience with streaming/messaging systems (Kafka, RabbitMQ) in production.
-
Background in digital health, IoT, or other missioncritical data platforms.
-
Familiarity with chaos engineering tools and costoptimization strategies for global cloud services.
-
Development experience in a modern backend language (Java, Kotlin, Go, Python) for tooling and automation.