Site Reliability Engineer

Randstad UK

Nottingham, United Kingdom

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Remote

Nottingham, United Kingdom

Tech stack

API

ARM

Bash

Data Infrastructure

Linux

DevOps

Distributed Systems

Elasticsearch

Python

Reliability Engineering

Ansible

Prometheus

Ruby

Scala

Bare Metal

Kafka

Terraform

Data Pipelines

ELK

Job description

We are looking for a Lead SRE to design, scale, and operate massive-scale observability systems that keep our global services online and performant. You will join an autonomous team of software engineers focused on solving complex data infrastructure challenges., * Scale Prometheus metrics infrastructure to handle 100+ million active series.

Operate large Elasticsearch clusters holding 2000+TB of data.
Grow high-throughput Kafka data pipelines processing hundreds of thousands of events per second.
Build custom alerting workflows and self-service APIs for internal engineering teams.
Provision cloud and private infrastructure using Terraform.

Requirements

5+ years operating mid-to-large distributed systems on Linux VMs or bare-metal machines.
2+ years developing in Go, Python, Ruby, Scala, or Bash.
Hands-on experience with Prometheus/Thanos/Cortex, Kafka, the ELK stack, Ansible, or Consul.
Comfortable diving into unfamiliar codebases and participating in an on-call rotation.

Keywords: Observability, Monitoring, SRE, Site Reliability Engineering, DevOps, ElasticSearch, ELK, Prometheus, Kafka, Terraform, Linux, Bare Metal

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all