Site Reliability Engineering Manager
Role details
Job location
Tech stack
Job description
This is an engineering leadership role, not simply an on-call manager. The SRE Manager owns two things: keeping RapidSOS's cloud infrastructure running reliably, and helping product teams get to a place where they can run their own services without routing every operational issue through SRE. RapidSOS powers real-time emergency response by connecting life-critical data to first responders, so reliability here directly impacts outcomes in moments that matter.
You'll lead the SRE Operations team and report to the Director of SRE & Platform Engineering. The team has real roots in NOC-style operations, and the honest goal of this role is to move it toward something more engineering-focused and proactive: better tooling, better practices, more ownership at the service team level. That's a gradual transition, and you'll be the one shaping how it happens.
What you'll do:
- Own the reliability, scalability, and operational health of RapidSOS Kubernetes clusters, shared services, and core AWS infrastructure; ensure upgrades, capacity planning, node scaling, and testing that multi-region failover actually works
- Drive the IaC foundation in Terraform/Atlantis and champion infrastructure-as-code as a core engineering standard
- Partner with Engineering Managers to set SLOs for their services, establish error budgets, and help teams build the habits to operate what they ship; the goal is for product teams to own their services, not to have SRE own everything on their behalf
- Maintain proactive reliability work: capacity planning, failure mode analysis, runbook quality, and chaos engineering exercises; run reliability reviews before major launches and organize failure mode exercises with product teams
- Drive blameless postmortem practice, ensures every significant incident produces systemic improvements with clear ownership and closure
- Run the Tier 1 on-call rotation: scheduling for primary and secondary engineers, coordination with the 3rd-party NOC, and keeping incident escalation processes smooth and manageable
- Lead incident command on Sev-1s, escalate when needed, and keep engineering leadership informed throughout
- Lead and grow a high-impact team by mentoring engineers, owning headcount, and thinking ahead about what the team needs as the function grows
- Shape the team's long-term AI strategy for infrastructure and operations by identifying opportunities for AI-driven automation and insight generation, evaluating tooling and workflows, and operationalizing best practices for scalable team-wide usage
- Own reserved instance strategy and the team's AWS cost footprint, error budgets and SLOs across production services and communicate that picture clearly to engineering and product leadership
- Work alongside Platform SRE on bigger infrastructure projects: Gateway API adoption, cross-region architecture, security changes
Requirements
Do you have experience in Virtual Private Clouds?, * 7+ years in SRE, platform engineering, or DevOps, with at least two years where you were responsible for a team and not just your own work
- You've been directly responsible for Kubernetes and AWS infrastructure in production environments where uptime and resilience are critical
- Experience moving a team from reactive ops toward engineering-first reliability practices
- You've worked collaboratively with engineering teams to proactively improve reliability, scalability, and operational readiness before issues reach production
- Ability to write Python,review production-quality scripts, and tooling
- You've applied SLOs, error budgets, and blameless postmortems in practice to improve reliability and drive better engineering decisionsHands-on familiarity with: Terraform/Atlantis, Kubernetes/Helm/ArgoCD, Datadog, Concourse CI/GitHub Actions, RabbitMQ, and AWS (EKS, RDS/Aurora, ElastiCache, VPC networking, IAM, KMS, Route53)
Benefits & conditions
What we offer:
- The chance to work with a passionate team on solving one of the largest challenges globally
- Competitive salary and benefits and equity participation
- A dynamic, flexible and fun start-up work environment with a highly talented team
If you're curious to learn more about RapidSOS, you can check out https://rapidsos.com/blog/
Starting pay for a successful applicant will depend on a variety of job-related factors, which may include experience, relevant skills, training, education, location, business needs, or market demands. The salary range for this role is $185,000 - $215,000. This role will also be eligible to receive equity options. #LI-Remote