Site Reliability Engineer / SRE / Systems Engineer
Role details
Job location
Tech stack
Job description
A fantastic opportunity for a Site Reliability Engineer / Systems Engineer to support highly available, scalable production systems within a fast-growing technology environment, working across cloud platforms, DevOps, networking and operational resilience.
If you've also worked in the following roles, we'd also like to hear from you: DevOps Engineer, Operations Engineer, Cloud Engineer, Platform Engineer, Systems Engineer, Infrastructure Engineer, Production Engineer, As a Site Reliability Engineer/ Systems Engineer you will act as the vital link between operations, end users and backend development teams, ensuring system availability, performance optimisation and effective incident management across live environments.
This Site Reliability Engineer/ Systems Engineer role offers the chance to work with modern cloud technologies, containerisation, observability tools and automation practices, while influencing long-term reliability improvements across business-critical systems., Your duties as the Site Reliability Engineer / Systems Engineer include:
-
Incident Triage and Ownership: Acting as first-line technical escalation for live production issues through to resolution or handover
-
System Monitoring and Availability: Maintaining high availability, performance and scalability of production platforms and services
-
Observability Implementation: Managing logging, monitoring, alerting and metrics to proactively identify and resolve issues
-
Reliability Improvements: Collaborating with development teams to translate operational insights into long-term platform resilience
-
Automation and Resilience: Supporting automation, incident response and continuous improvement practices
-
New Service Support: Ensuring new products and features are operable, reliable and scalable from day one
-
Cross-Team Collaboration: Working with network engineering, operations and support teams to diagnose service issues
-
Documentation and Reporting: Creating and maintaining runbooks, escalation guides and incident reports
-
Incident Prioritisation: Balancing customer impact with long-term system health and stability
-
Security and Compliance: Supporting compliance with security, availability and regulatory frameworks
Requirements
-
Previous experience in a Site Reliability Engineer, DevOps Engineer, Systems Engineer or Operations Engineer role
-
Experience supporting production services at scale within a DevOps or SRE environment
-
Strong working knowledge of ISP-related networking concepts including DNS, DHCP, PPPoE, RADIUS and IPv4/IPv6
-
Experience with observability tools such as Prometheus, Grafana, ELK or Splunk
-
Hands-on experience with containerisation and orchestration using Docker and Kubernetes
-
Cloud platform experience, ideally Google Cloud Platform, including automation and scaling practices
-
Strong Linux administration skills with scripting capability in Bash, Python or similar
-
Familiarity with CI/CD pipelines and source control tools such as GitHub Actions
-
Understanding of security frameworks and operational resilience best practices
DESIRABLE
-
Experience within ISP, MSP or telecommunications environments
-
Familiarity with enterprise IT architectures including OSS and BSS systems
-
Knowledge of information security frameworks such as ISO27001, NIST or GDPR
-
Experience with infrastructure automation tools such as Terraform or Ansible