Search - Search Inference - Senior Site Reliability Engineer

Elastic
Charing Cross, United Kingdom
4 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
£ 92K

Job location

Charing Cross, United Kingdom

Tech stack

Bash
Linux
Python
Software Maintenance
Reliability Engineering
Pulumi
Kubernetes
Performance Monitor
Terraform

Job description

  • Working with the wider team to evolve our inference service so it may scale efficiently and reliably, hosting a growing number of models for semantic search, agentic workflows and foundation models.

  • Ensuring proactive monitoring and SLO-based alerting using error budgets to prevent incidents before they happen.

  • Enhancing the scalability and reliability of the service and partnering with the team to ensure knowledge is shared, clear documentation is produced, and best practices are followed.

  • Growing our global infrastructure to meet increasing scaling demands by developing and maintaining software, tooling, and automations.

  • Collaborating in an inclusive environment, focusing on operational excellence and uplifting each other with constructive feedback.

  • Being part of an SRE on-call rotation responding to operational needs and incidents

Requirements

  • 5+ years of experience in a site reliability engineer (or equivalent) role, operating services in production at scale

  • 3+ years of experience with Kubernetes, Helm & containerised services

  • Experience Terraform/Pulumi/Crossplane or similar

  • Experience writing non-trivial code in a language like Python, Go, or equivalent

  • Strong Linux fundamentals, experience writing Bash scripts

  • Strong written communication

BONUS

  • Experience working with Ray and KubeRay is a big plus!

  • Experience working with the Elastic Observability Stack

Apply for this position