Senior Site Reliability Engineer / Technical Architect

ERP Limited
Winnersh Civil Parish, United Kingdom
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
£ 45K

Job location

Winnersh Civil Parish, United Kingdom

Tech stack

Microsoft Active Directory
Artificial Intelligence
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Azure
Intelligent Platform Management Interface
Bash
Cloud Computing
Cloud Engineering
Cloud Storage
Continuous Integration
Cursor (Graphical User Interface Elements)
Dynamic Host Configuration Protocol
Linux
DevOps
DNS
Github
Monitoring of Systems
Identity and Access Management
Virtual Private Networks (VPN)
Python
Lightweight Directory Access Protocols (LDAP)
Windows Server
Nagios
Routing
Octopus Deploy
Public Key Infrastructure
Role-Based Access Control
Red Hat Enterprise Linux - RHEL
Reliability Engineering
Software Tools
Ansible
Prometheus
Virtual Machines
Datadog
Load Balancing
High Performance Computing
Okta
Cloud Monitoring
GitHub Copilot
System Availability
Grafana
Firewalls (Computer Science)
Amazon Web Services (AWS)
Amazon Web Services (AWS)
PySpark
Gitlab-ci
Git Flow
Kubernetes
Infrastructure Automation Frameworks
Bare Metal
Slurm
Route53
Terraform
Splunk
Docker
Jenkins

Job description

Design, build, and maintain scalable cloud infrastructure across AWS and Azure.

Manage Kubernetes platforms including EKS, AKS, Helm, Argo CD, and GitOps workflows.

Create reusable Terraform, Ansible, and automation patterns for infrastructure provisioning.

Define and improve SLOs, SLIs, monitoring, alerting, dashboards, and incident response processes.

Implement observability using tools such as Datadog, Grafana, Prometheus, Loki, Tempo, OpenTelemetry, Splunk, and related platforms.

Improve platform reliability, reduce operational toil, and support root cause analysis during incidents.

Support secure infrastructure access using IAM, Okta, Teleport, RBAC, MFA, TLS/PKI, Secrets Manager, and cloud security controls.

Work with CI/CD tools such as Jenkins, GitLab CI, GitHub Actions, and Argo CD to improve deployment reliability.

Support Linux, Windows Server, Active Directory, DNS, DHCP, LDAP, and Group Policy environments.

Manage large-scale GPU/HPC workloads using SLURM, PySpark, anomaly detection pipelines, and bare-metal provisioning with IPMI and PXE boot.

Apply AI-assisted engineering tools such as Cursor, Claude Code, GitHub Copilot, AWS Bedrock, Ollama, Datadog Watchdog, and Grafana AI Agents to improve automation, troubleshooting, and delivery.

Partner with engineering, security, and business teams to turn operational and regulatory requirements into practical platform standards.

Requirements

Do you have experience in Terraform?, We are looking for a highly experienced Senior Site Reliability Engineer / Technical Architect with strong hands-on expertise in cloud infrastructure, Kubernetes, platform engineering, automation, observability, and AI-assisted engineering.

The ideal candidate will have deep experience designing, building, and operating reliable, scalable, and secure infrastructure across AWS, Azure, Kubernetes, Terraform, CI/CD, GitOps, and monitoring platforms. This role requires strong ownership of production systems, incident management, automation, infrastructure standards, and collaboration with engineering, security, and platform teams., Strong experience in Site Reliability Engineering, DevOps, Cloud Infrastructure, or Platform Engineering.

Hands-on experience with AWS services such as EC2, EKS, ECS, Lambda, RDS, S3, VPC, CloudFront, Route 53, IAM, KMS, WAF, and Secrets Manager.

Experience with Azure services including AKS, Virtual Machines, Virtual Networks, Storage Accounts, Load Balancer, Azure Monitor, and Entra ID.

Strong Kubernetes, Docker, Helm, Terraform, Ansible, and GitOps experience.

Good scripting and automation skills using Python, Bash, or similar languages.

Strong monitoring and observability experience with Datadog, Grafana, Prometheus, Loki, Tempo, OpenTelemetry, Splunk, or Nagios.

Experience with incident response, production support, root cause analysis, capacity planning, cost optimisation, and reliability improvement.

Good understanding of networking, DNS, DHCP, LDAP, load balancers, firewalls, CDN, VPN, and security controls.

Experience working in regulated, high-availability, or large-scale production environments.

Preferred Certifications

Certified Kubernetes Administrator

AWS Certified Solutions Architect

Red Hat Certified Engineer

Microsoft Certified Solutions Expert

CCNA Routing and Switching / Security, This role is suitable for a senior engineer or architect with 15+ years of experience across SRE, cloud, DevOps, infrastructure, and platform engineering. The candidate should be comfortable working across both hands-on technical delivery and architecture-level decision making, with a strong focus on reliability, automation, security, and developer productivity.

Apply for this position