Search - Search Inference - Senior Site Reliability Engineer
Role details
Job location
Tech stack
Job description
-
Working with the wider team to evolve our inference service so it may scale efficiently and reliably, hosting a growing number of models for semantic search, agentic workflows and foundation models.
-
Ensuring proactive monitoring and SLO-based alerting using error budgets to prevent incidents before they happen.
-
Enhancing the scalability and reliability of the service and partnering with the team to ensure knowledge is shared, clear documentation is produced, and best practices are followed.
-
Growing our global infrastructure to meet increasing scaling demands by developing and maintaining software, tooling, and automations.
-
Collaborating in an inclusive environment, focusing on operational excellence and uplifting each other with constructive feedback.
-
Being part of an SRE on-call rotation responding to operational needs and incidents
Requirements
-
5+ years of experience in a site reliability engineer (or equivalent) role, operating services in production at scale
-
3+ years of experience with Kubernetes, Helm & containerised services
-
Experience Terraform/Pulumi/Crossplane or similar
-
Experience writing non-trivial code in a language like Python, Go, or equivalent
-
Strong Linux fundamentals, experience writing Bash scripts
-
Strong written communication
BONUS
-
Experience working with Ray and KubeRay is a big plus!
-
Experience working with the Elastic Observability Stack