Site Reliability Engineer
AWD
Manchester, United Kingdom
2 days ago
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Compensation
£ 70KJob location
Manchester, United Kingdom
Tech stack
Systems Engineering
Bash
Cloud Computing
Computer Security
Continuous Integration
Dynamic Host Configuration Protocol
Linux
DevOps
DNS
Github
IPv4
IPv6
Python
Linux System Administration
Reliability Engineering
Ansible
Prometheus
Data Logging
Google Cloud Platform
Cloud Platform System
System Availability
Grafana
IT Architecture
Backend
Containerization
Kubernetes
Terraform
Splunk
PPPoE
Software Version Control
Docker
Job description
- Acting as first-line technical escalation for live production issues through to resolution or handover
- Maintaining high availability, performance and scalability of production platforms and services
- Managing logging, monitoring, alerting and metrics to proactively identify and resolve issues
- Collaborating with development teams to translate operational insights into long-term platform resilience
- Supporting automation, incident response and continuous improvement practices
- Ensuring new products and features are operable, reliable and scalable from day one
- Working with network engineering, operations and support teams to diagnose service issues
- Creating and maintaining runbooks, escalation guides and incident reports
- Balancing customer impact with long-term system health and stability
- Supporting compliance with security, availability and regulatory frameworks
Technologies:
- Bash
- CI/CD
- Cloud
- DevOps
- Docker
- ELK
- GitHub
- Grafana
- Support
- Kubernetes
- Linux
- Network
- OSS
- Prometheus
- Python
- Security
- Splunk
- Terraform
- Ansible
- Backend
Requirements
- Previous experience in a Site Reliability Engineer, DevOps Engineer, Systems Engineer or Operations Engineer role
- Experience supporting production services at scale within a DevOps or SRE environment
- Strong working knowledge of ISP-related networking concepts including DNS, DHCP, PPPoE, RADIUS and IPv4/IPv6
- Experience with observability tools such as Prometheus, Grafana, ELK or Splunk
- Hands-on experience with containerisation and orchestration using Docker and Kubernetes
- Cloud platform experience, ideally Google Cloud Platform, including automation and scaling practices
- Strong Linux administration skills with scripting capability in Bash, Python or similar
- Familiarity with CI/CD pipelines and source control tools such as GitHub Actions
- Understanding of security frameworks and operational resilience best practices
- DESIRABLE
- Experience within ISP, MSP or telecommunications environments
- Familiarity with enterprise IT architectures including OSS and BSS systems
- Knowledge of information security frameworks such as ISO27001, NIST or GDPR
- Experience with infrastructure automation tools such as Terraform or Ansible