Search - Search Inference - Senior Site Reliability Engineer

Elastic
19 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
£ 88K

Job location

Tech stack

Amazon Web Services (AWS)
Azure
Bash
Linux
Distributed Systems
Elasticsearch
Information Retrieval
Python
Machine Learning
Natural Language Processing
Software Maintenance
Reliability Engineering
Pulumi
Kubernetes
Performance Monitor
Terraform

Job description

The Search Inference team is responsible for bringing performant, ergonomic, and cost effective machine learning (ML) model inference to Search workflows. ML inference has become a crucial part of the modern search experience whether used for query understanding, semantic search, RAG, or any other GenAI use-case.

Our goal is to simplify ML inference in Search workflows by focusing on large scale inference capabilities for embeddings and reranking models that are available across the Elasticsearch user base. As a team, we are a collaborative, cross-functional group with backgrounds in information retrieval, natural language processing, and distributed systems. We work with Go services, Python, Ray Serve, Kubernetes/KubeRay, and work in AWS, GCP & Azure.

We provide thought leadership across a variety of mediums including open code repositories, publishing blogs, and speaking at conferences. We focus on matching the expectations of our customers along the lines of throughput, latency, and cost. We're seeking an experienced Senior Site Reliability Engineer to help us deliver on this vision!

  • Working with the wider team to evolve our inference service so it may scale efficiently and reliably, hosting a growing number of models for semantic search, agentic workflows and foundation models.
  • Ensuring proactive monitoring and SLO-based alerting using error budgets to prevent incidents before they happen.
  • Enhancing the scalability and reliability of the service and partnering with the team to ensure knowledge is shared, clear documentation is produced, and best practices are followed
  • Growing our global infrastructure to meet increasing scaling demands by developing and maintaining software, tooling, and automations.
  • Collaborating in an inclusive environment, focusing on operational excellence and uplifting each other with constructive feedback.
  • Being part of an SRE on-call rotation responding to operational needs and incidents.

Requirements

  • 5+ years of experience in a site reliability engineer (or equivalent) role, operating services in production at scale
  • 3+ years of experience with Kubernetes, Helm & containerised services
  • Experience Terraform/Pulumi/Crossplane or similar
  • Experience writing non-trivial code in a language like Python, Go, or equivalent
  • Strong Linux fundamentals, experience writing Bash scripts
  • Strong written communication

Experience working with Ray and KubeRay is a big plus! Experience working with the Elastic Observability Stack

Apply for this position