SRE Engineer

GBST
Charing Cross, United Kingdom
4 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Charing Cross, United Kingdom

Tech stack

Java
Artificial Intelligence
Amazon Web Services (AWS)
Continuous Integration
Disaster Recovery
Amazon DynamoDB
Elasticsearch
Monitoring of Systems
Identity and Access Management
JMeter
Python
Load Testing
Nginx
RabbitMQ
Reliability Engineering
Prometheus
Ruby
Strategies of Testing
Datadog
SSL Certificate Management
Data Logging
File Transfer Protocol (FTP)
Autoscaling
Istio
System Availability
Grafana
Mttr
Reliability of Systems
Cloudformation
Kubernetes
Infrastructure Automation Frameworks
Linkerd (Service Mesh)
Cloudwatch
Api Gateway
Terraform
New Relic (SaaS)
Docker
Pagerduty

Job description

We're now on the lookout for a SRE Engineer. You'll be joining a global, diverse team working with cross-functional stakeholders. This is a permanent full time opportunity based in London., The type of person suitable for this role, * Managing and optimising our infrastructure to ensure high availability and system reliability.

  • Deliver 24/7 support via on call rotation for after hour issues
  • Infrastructure Automation Expertise:
  • Experience with the AWS cloud platform including designing, deploying, and maintaining

scalable infrastructure.

Requirements

Do you have experience in Terraform?, * Ability to work on multiples tasks in parallel

  • Problem solver
  • Excellent communicator
  • Desire to improve things

What skills you will need?

  • Kubernetes

o Kubernetes and application troubleshooting

o Application deployment GitOps / ArgoCD

o K8s and application logging (Loki / fluent bit)

o Service Mesh (Linkerd preferred)

o Ingress Config / Troubleshooting (AWS LB Controller / Nginx)

o Autoscaling configuration (Karpenter)

o Certificate management (cert-manager)

  • AWS services

o EKS

o RDS, DMS, RDS Proxy

o AWS Backup

o API Gateway

o RabbitMQ

o AWS Transfer Family (SFTP / SFTP Connector)

o AWS NGFW, TGW, PrivateLink

o AppStream

o Lambda - Python

o IAM

o Kinesis

o DynamoDB

  • Terragrunt / Terraform

o Troubleshooting defects

  • GitOps

o Helm / ArgoCD

  • Observability Tooling

o Grafana, Prometheus, Loki, Cloudwatch configuration/dashboard creation

  • CI/CD, * Strong knowledge of container orchestration tools like Kubernetes and Docker.
  • Familiarity with deploying infrastructure as Code (IaC) with Terraform and CloudFormation.
  • Chaos Engineering Proficiency:
  • Understanding of implementing resilience testing strategies
  • Designing and implementing chaos engineering tools like AWS Fault Injection, Gremlin, Chaos

Monkey, or LitmusChaos to design and execute fault injection experiments.

  • Knowledge of modern chaos engineering trends, such as adaptive resilience testing or AI driven fault detection.
  • Monitoring and Observability:
  • Experience with monitoring and observability tools (e.g., Prometheus, ADOT, Grafana, Datadog,

New Relic, Elastic Stack).

  • Strong understanding of instrumenting infrastructure with metrics, logging, and tracing
  • Automation and Scripting:
  • Proficiency in scripting and automation languages (e.g., Python, Go, Shell, Ruby, or Java).
  • Demonstrated ability to automate infrastructure and operational processes.
  • Incident Management and Root Cause Analysis:
  • Participating in incident response processes, including triage, mitigation, and communication.
  • Familiarity with incident management tools like PagerDuty or Opsgenie.
  • Responding to production incidents, troubleshoot issues across the full stack, and ensure

minimal downtime by driving root cause analysis and applying long-term fixes.

  • Conducting blameless post-mortems to identify root causes and derive actionable insights,

ensuring continuous improvement.

  • Developing playbooks for common incidents, reducing Mean Time to Resolution (MTTR)
  • Resilience and Scalability Design:
  • Understanding of system design principles, scalability, and high-availability architectures.
  • Practical experience with load testing and performance benchmarking tools (e.g., JMeter,

Locust, k6).

  • Designing and testing disaster recovery (DR) strategies to ensure minimal downtime and data

Benefits & conditions

Pulled from the full job description

  • Employee discount

  • Employee assistance programme

  • Company pension

  • Private medical insurance

  • Cycle to work scheme

  • Car scheme, * Instant savings and discounts on major retailers across the country

  • Private Health Insurance including Dental and Optical Cover

  • Non-contributory Pension Scheme

  • Salary Sacrifice Schemes - Car, Cycle to Work and Additional Pension Contributions

  • Additional GBST & U day off every year

  • Employee Assistance Program (EAP)

  • LinkedIn Learning

About the company

At GBST, we're inspiring wealth innovation for wealth management and advice organisations globally. Our commitment to excellence, track record of continued and successful delivery, hard work and product excellence has earned us the trust and partnership of many of the world's leading financial services organisations.

Apply for this position