Back

Search - Search Inference - Senior Site Reliability Engineer

Elastic

Charing Cross, United Kingdom

4 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

£ 92K

Job location

Charing Cross, United Kingdom

Tech stack

Bash

Linux

Python

Software Maintenance

Reliability Engineering

Pulumi

Kubernetes

Performance Monitor

Terraform

Job description

Working with the wider team to evolve our inference service so it may scale efficiently and reliably, hosting a growing number of models for semantic search, agentic workflows and foundation models.
Ensuring proactive monitoring and SLO-based alerting using error budgets to prevent incidents before they happen.
Enhancing the scalability and reliability of the service and partnering with the team to ensure knowledge is shared, clear documentation is produced, and best practices are followed.
Growing our global infrastructure to meet increasing scaling demands by developing and maintaining software, tooling, and automations.
Collaborating in an inclusive environment, focusing on operational excellence and uplifting each other with constructive feedback.
Being part of an SRE on-call rotation responding to operational needs and incidents

Requirements

5+ years of experience in a site reliability engineer (or equivalent) role, operating services in production at scale
3+ years of experience with Kubernetes, Helm & containerised services
Experience Terraform/Pulumi/Crossplane or similar
Experience writing non-trivial code in a language like Python, Go, or equivalent
Strong Linux fundamentals, experience writing Bash scripts
Strong written communication

BONUS

Experience working with Ray and KubeRay is a big plus!
Experience working with the Elastic Observability Stack

Apply for this position