Site Reliability Engineer

RedTech Recruitment

Cambridge, United Kingdom

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Cambridge, United Kingdom

Tech stack

Agile Methodologies

Amazon Web Services (AWS)

Azure

Bash

Cloud Computing

Cloud Computing Security

Continuous Integration

DevOps

Monitoring of Systems

Integrated Development Environments

Python

PostgreSQL

Linux System Administration

Nginx

Reliability Engineering

Prometheus

TypeScript

CircleCI

Scripting (Bash/Python/Go/Ruby)

Okta

React

Flask

Grafana

GIT

Kubernetes

Infrastructure Automation Frameworks

Information Technology

Terraform

Docker

Job description

Develop and enhance monitoring systems to proactively identify performance, reliability, security, and cost issues
Monitor platform performance and communicate insights to engineering teams
Support incident response and assist with remediation of platform vulnerabilities
Identify, plan, and implement improvements to cloud infrastructure and deployment processes
Work closely with engineering teams to support product development and platform scalability
Ensure infrastructure and deployments are secure, robust, and aligned with best practices
Advocate for effective monitoring and reliability considerations throughout the development lifecycle
Support ongoing compliance with information security standards including ISO 27001

Requirements

Minimum 2:1 degree in Computer Science or a related field
2+ years' experience in a DevOps, SRE, Platform Engineering or similar role
Experience configuring and using monitoring tools such as Grafana and Prometheus
Hands-on experience with cloud infrastructure, ideally GCP (Azure or AWS also considered)
Experience with Infrastructure-as-Code tools such as Terraform
Experience working with Docker, Kubernetes, and Helm
Strong understanding of cloud security and reliability best practices
Scripting experience using Python and/or Bash
Experience using Git within a professional software development environment
Strong problem-solving and analytical skills with a proactive mindset

Desirable:

Experience responding to and investigating security or reliability incidents in distributed cloud environments
Ability to communicate technical challenges to non-technical stakeholders
Familiarity with technologies such as NGINX, Flask (Python), React (TypeScript), PostgreSQL,
OpenSearch, Valkey, or Keycloak
Experience administering Linux-based systems
Experience with CI tools such as CircleCI
Exposure to information security compliance standards (e.g. ISO 27001)
Experience working within Agile development environments

Benefits & conditions

Salary: Negotiable, * A hands-on SRE role with exposure to modern cloud-native technologies and infrastructure

The opportunity to work on complex, real-world problems within industrial R&D environments
A collaborative, high-calibre engineering team within a growing Cambridge-based business
A competitive salary and benefits package

About the company

An exciting opportunity for a Site Reliability Engineer to join an award-winning, Cambridge-based AI software company at the forefront of machine learning innovation.