Senior Software Engineer - SRE
Role details
Job location
Tech stack
Job description
You'll join the Infra Platform team - the backbone of everything engineering at Vibe. We own hosting, provisioning, observability, CI/CD, internal tooling, security, and compliance. Every product and platform team runs on what we build.
You'll report to our Lead Platform Engineer, and together with the team, you'll help shape the technical direction and uphold the infrastructure standards the whole company relies on.
This role exists because our platform is growing fast, and more than 60 engineers depend on it every day. We're looking for someone who can take on real ownership and help shape how it scales. You'll work on infrastructure handling more than 600k QPS, latency targets under 10ms, and over 5 PB of storage. Just as importantly, you'll be someone other engineers turn to when the hard problems show up, which gives this role real influence across the company.
What You'll Do
Keep the platform running at scale
- Own reliability across hosting, provisioning, network, and compute - targeting 99.99% uptime
- Proactively maintain infrastructure, track drift, and lead migrations before they become incidents
- Respond to production issues quickly, communicate clearly, and fix root causes for lasting improvements.
Build and evolve cloud infrastructure
- Design and implement core infra components that scale to the next orders of magnitude
- Make smart build-vs-buy calls and evolve systems as business needs shift
- Manage infrastructure as code across different providers
Improve developer experience across the org
- Make it easier for engineering teams to move fast
- Build shared tooling: internal CLIs, shared libraries, automation scripts in Go or Python
- Partner with dev teams to help them ship faster without creating technical debt
Support ML infrastructure and build real-time streaming systems
- Support the infrastructure behind frequent model retraining on large datasets
- Help improve compute efficiency and model serving performance to meet inference latency targets
- Build, scale, and `operate the real-time streaming platform
Champion security with exposure to compliance requirements
- Enforce best practices around permissions, secret handling, and network security
- Support SOC2 compliance work and embed security into new projects from day one
- Be the pragmatic voice in risk trade-offs - protection without paralysis
Requirements
- Hands-on experience operating production infrastructure at meaningful scale, with strong instincts around reliability, resilience, and performance under load
- Strong experience with infrastructure as code, especially Terraform, with ownership of the full lifecycle from implementation to continuous improvement
- Fluency in at least one systems-oriented language, ideally Go or Python, with the ability to build automation and operational tooling
- Deep experience with CI/CD, observability, and production operations, including metrics, logs, traces, alerting, and debugging live systems
- Comfortable leading incident response, improving service reliability, and driving root cause resolution
- Able to support product and engineering teams as a trusted infrastructure partner
- Uses AI effectively in day-to-day engineering, with strong judgment about when it adds leverage and when deeper manual work and critical thinking matter more.
Nice to Haves
- Built internal developer tooling - service scaffolding, unified observability portals, self-serve provisioning
- Hands-on experience with SOC2 compliance - you've helped a team get there, not just read about it