Staff Site Reliability Engineer (SRE)
Role details
Job location
Tech stack
Job description
To deliver on this vision, our engineers tackle significant distributed systems challenges on a daily basis, where designing for reliability and performance at scale is essential to ensuring a seamless user experience. As we unlock the next level of scale, these challenges become even more critical. We're looking to bring on a Staff SRE with a strong software engineering grounding to help drive this next phase of growth.
We're also looking for people who thrive in an empowered environment - individuals who are comfortable being given problems to solve rather than solutions to implement. You should enjoy working at pace, value autonomy, and prefer to ask for forgiveness rather than permission. As a senior member of the team, you'll influence infrastructure direction, work closely with stakeholders across engineering, and play a key role in mentoring and developing engineers in reliability-focused practices.
Whereas Sparta is a remote-first company, for this role we're looking for someone who values a hybrid working style, which in a typical week could involve spending a couple of days in the office - with flexibility built in.
What You'll Be Doing:
- Help lead the design and evolution of reliable, scalable, and observable infrastructure underpinning our real-time and analytical data systems.
- Work as part of the engineering team, rather than a separate infrastructure function, fostering a culture of "you build it, you run it" and approaching SRE from a strong software engineering foundation.
- Drive SRE best practices, including SLIs/SLOs, error budgets, and operational readiness across engineering teams.
- Optimise systems for performance, availability, and cost efficiency across cloud environments.
- Mentor engineers in distributed systems, reliability engineering, operational excellence, and infrastructure design.
Requirements
Do you have experience in Redis?, * 7+ years in software engineering, DevOps, SRE, or infrastructure roles, with a focus on software engineering
- 2+ years working within a product-focused organisation, collaborating closely with cross-functional teams.
- Deep understanding of distributed systems, system performance, and reliability engineering.
- Proven experience operating cloud-hosted production systems at scale (AWS or GCP preferred).
- Strong capability to reason about system trade-offs, including latency, throughput, failure modes, redundancy, and cost.
- Experienced in building, analysing, and maintaining observability systems (metrics, logging, tracing, alerting).
- Comfortable with multiple languages such as Python, Java, Kotlin, or TypeScript.
- Equally comfortable working at a high architectural level and diving into low-level system details.
- Experienced with infrastructure technologies such as Kafka, Flink, Redis, and clustered Postgres.
- Strong capability in infrastructure automation, CI/CD, and cloud engineering practices.
- Experience with networking concepts and technologies, including routing, load balancing, and service-to-service communication.
Nice to have experience:
- Experience leading teams or managing engineers, driving both delivery and technical excellence.
- Exposure to complex distributed environments with demanding constraints such as high throughput, low latency, or large-scale datasets.
- Hands-on experience managing self-hosted infrastructure such as Kubernetes, Istio, and related ecosystem components.