SRE Engineer
Role details
Job location
Tech stack
Job description
We're now on the lookout for a SRE Engineer. You'll be joining a global, diverse team working with cross-functional stakeholders. This is a permanent full time opportunity based in London., The type of person suitable for this role, * Managing and optimising our infrastructure to ensure high availability and system reliability.
- Deliver 24/7 support via on call rotation for after hour issues
- Infrastructure Automation Expertise:
- Experience with the AWS cloud platform including designing, deploying, and maintaining
scalable infrastructure.
Requirements
Do you have experience in Terraform?, * Ability to work on multiples tasks in parallel
- Problem solver
- Excellent communicator
- Desire to improve things
What skills you will need?
- Kubernetes
o Kubernetes and application troubleshooting
o Application deployment GitOps / ArgoCD
o K8s and application logging (Loki / fluent bit)
o Service Mesh (Linkerd preferred)
o Ingress Config / Troubleshooting (AWS LB Controller / Nginx)
o Autoscaling configuration (Karpenter)
o Certificate management (cert-manager)
- AWS services
o EKS
o RDS, DMS, RDS Proxy
o AWS Backup
o API Gateway
o RabbitMQ
o AWS Transfer Family (SFTP / SFTP Connector)
o AWS NGFW, TGW, PrivateLink
o AppStream
o Lambda - Python
o IAM
o Kinesis
o DynamoDB
- Terragrunt / Terraform
o Troubleshooting defects
- GitOps
o Helm / ArgoCD
- Observability Tooling
o Grafana, Prometheus, Loki, Cloudwatch configuration/dashboard creation
- CI/CD, * Strong knowledge of container orchestration tools like Kubernetes and Docker.
- Familiarity with deploying infrastructure as Code (IaC) with Terraform and CloudFormation.
- Chaos Engineering Proficiency:
- Understanding of implementing resilience testing strategies
- Designing and implementing chaos engineering tools like AWS Fault Injection, Gremlin, Chaos
Monkey, or LitmusChaos to design and execute fault injection experiments.
- Knowledge of modern chaos engineering trends, such as adaptive resilience testing or AI driven fault detection.
- Monitoring and Observability:
- Experience with monitoring and observability tools (e.g., Prometheus, ADOT, Grafana, Datadog,
New Relic, Elastic Stack).
- Strong understanding of instrumenting infrastructure with metrics, logging, and tracing
- Automation and Scripting:
- Proficiency in scripting and automation languages (e.g., Python, Go, Shell, Ruby, or Java).
- Demonstrated ability to automate infrastructure and operational processes.
- Incident Management and Root Cause Analysis:
- Participating in incident response processes, including triage, mitigation, and communication.
- Familiarity with incident management tools like PagerDuty or Opsgenie.
- Responding to production incidents, troubleshoot issues across the full stack, and ensure
minimal downtime by driving root cause analysis and applying long-term fixes.
- Conducting blameless post-mortems to identify root causes and derive actionable insights,
ensuring continuous improvement.
- Developing playbooks for common incidents, reducing Mean Time to Resolution (MTTR)
- Resilience and Scalability Design:
- Understanding of system design principles, scalability, and high-availability architectures.
- Practical experience with load testing and performance benchmarking tools (e.g., JMeter,
Locust, k6).
- Designing and testing disaster recovery (DR) strategies to ensure minimal downtime and data
Benefits & conditions
Pulled from the full job description
-
Employee discount
-
Employee assistance programme
-
Company pension
-
Private medical insurance
-
Cycle to work scheme
-
Car scheme, * Instant savings and discounts on major retailers across the country
-
Private Health Insurance including Dental and Optical Cover
-
Non-contributory Pension Scheme
-
Salary Sacrifice Schemes - Car, Cycle to Work and Additional Pension Contributions
-
Additional GBST & U day off every year
-
Employee Assistance Program (EAP)
-
LinkedIn Learning