Site Reliability Engineer / SRE / Systems Engineer

AWD

Altrincham, United Kingdom

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Compensation

£ 70K

Job location

Remote

Altrincham, United Kingdom

Tech stack

Systems Engineering

Bash

Cloud Computing

Cloud Engineering

Computer Security

Dynamic Host Configuration Protocol

DevOps

DNS

Github

Monitoring of Systems

IPv4

IPv6

Python

Linux System Administration

Reliability Engineering

Ansible

Prometheus

Data Logging

Google Cloud Platform

Cloud Platform System

System Availability

Grafana

IT Architecture

Backend

Containerization

Kubernetes

Terraform

Splunk

PPPoE

Software Version Control

Docker

Job description

A fantastic opportunity for a Site Reliability Engineer / Systems Engineer to support highly available, scalable production systems within a fast-growing technology environment, working across cloud platforms, DevOps, networking and operational resilience.

If you've also worked in the following roles, we'd also like to hear from you: DevOps Engineer, Operations Engineer, Cloud Engineer, Platform Engineer, Systems Engineer, Infrastructure Engineer, Production Engineer, As a Site Reliability Engineer/ Systems Engineer you will act as the vital link between operations, end users and backend development teams, ensuring system availability, performance optimisation and effective incident management across live environments.

This Site Reliability Engineer/ Systems Engineer role offers the chance to work with modern cloud technologies, containerisation, observability tools and automation practices, while influencing long-term reliability improvements across business-critical systems., Your duties as the Site Reliability Engineer / Systems Engineer include:

Incident Triage and Ownership: Acting as first-line technical escalation for live production issues through to resolution or handover
System Monitoring and Availability: Maintaining high availability, performance and scalability of production platforms and services
Observability Implementation: Managing logging, monitoring, alerting and metrics to proactively identify and resolve issues
Reliability Improvements: Collaborating with development teams to translate operational insights into long-term platform resilience
Automation and Resilience: Supporting automation, incident response and continuous improvement practices
New Service Support: Ensuring new products and features are operable, reliable and scalable from day one
Cross-Team Collaboration: Working with network engineering, operations and support teams to diagnose service issues
Documentation and Reporting: Creating and maintaining runbooks, escalation guides and incident reports
Incident Prioritisation: Balancing customer impact with long-term system health and stability
Security and Compliance: Supporting compliance with security, availability and regulatory frameworks

Requirements

Previous experience in a Site Reliability Engineer, DevOps Engineer, Systems Engineer or Operations Engineer role
Experience supporting production services at scale within a DevOps or SRE environment
Strong working knowledge of ISP-related networking concepts including DNS, DHCP, PPPoE, RADIUS and IPv4/IPv6
Experience with observability tools such as Prometheus, Grafana, ELK or Splunk
Hands-on experience with containerisation and orchestration using Docker and Kubernetes
Cloud platform experience, ideally Google Cloud Platform, including automation and scaling practices
Strong Linux administration skills with scripting capability in Bash, Python or similar
Familiarity with CI/CD pipelines and source control tools such as GitHub Actions
Understanding of security frameworks and operational resilience best practices