Site Reliability Engineer (SRE)
Charles Simon Associates Ltd
4 days ago
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Compensation
£ 125KJob location
Remote
Tech stack
Azure
Bash
DevOps
Distributed Systems
Identity and Access Management
Python
Load Testing
Log Analysis
Powershell
Reliability Engineering
Prometheus
Systems Integration
Web Applications
Datadog
Pulumi
Scripting (Bash/Python/Go/Ruby)
Grafana
Cloudformation
Kubernetes
Azure
Puppet
Terraform
Vulnerability Analysis
Microservices
Job description
Site Reliability Engineer - (SRE, Site Reliability Engineer, Terraform, AKS, Azure, Kubernetes, PowerShell, Python, Bash, Datadog, Monitoring Tools) - Permanent - Remote, * Designing and enforcing service-level objectives (SLOs), SLIs, and SLAs to ensure reliability targets are measurable and aligned with business expectations
- Implementing incident response frameworks, including runbooks, postmortems, and blameless RCA processes to drive continuous improvement
- Integrating observability tooling (e.g. Prometheus, Grafana, Datadog, OpenTelemetry) to enable proactive detection and resolution of system anomalies
- Managing infrastructure as code (IaC) using tools like Terraform, Pulumi, or CloudFormation to ensure repeatable, auditable deployments
- Optimizing cost and resource utilization across cloud environments through rightsizing, autoscaling, and lifecycle policies
- Driving chaos engineering initiatives to test system resilience under failure conditions and validate recovery strategies
- Championing security best practices within infrastructure-e.g. secrets management, IAM policies, and vulnerability scanning
- Collaborating with DevOps and platform teams to build paved-road deployment patterns and internal developer portals
- Leading capacity planning and load testing efforts to anticipate scaling needs and prevent bottlenecks
- Contributing to architectural decisions that impact reliability, latency, and fault domains across distributed systems
Requirements
- Extensive SRE experience within previous roles
- Strong Terraform skills
- Proven Kubernetes and AKS experience
- Experience in creating and modifying terraform deployment on live environments
- Experience with Monitoring solutions ideally Datadog, however Azure Application Insight, Log Analytics or Grafana
- Scripting skills for automation within; PowerShell, Python or Bash
- Experience with web based applications
Desirable Skills:
- Knowledge or commercial experience of Microservices Architecture
- Kanban
- Any prior experience of working with Puppet and Chef would be advantageous