Site Reliability Engineer

RedTech Recruitment
Cambridge, United Kingdom
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate

Job location

Cambridge, United Kingdom

Tech stack

Agile Methodologies
Amazon Web Services (AWS)
Azure
Bash
Cloud Computing
Cloud Computing Security
Continuous Integration
DevOps
Monitoring of Systems
Integrated Development Environments
Python
PostgreSQL
Linux System Administration
Nginx
Reliability Engineering
Prometheus
TypeScript
CircleCI
Scripting (Bash/Python/Go/Ruby)
Okta
React
Flask
Grafana
GIT
Kubernetes
Infrastructure Automation Frameworks
Information Technology
Terraform
Docker

Job description

  • Develop and enhance monitoring systems to proactively identify performance, reliability, security, and cost issues
  • Monitor platform performance and communicate insights to engineering teams
  • Support incident response and assist with remediation of platform vulnerabilities
  • Identify, plan, and implement improvements to cloud infrastructure and deployment processes
  • Work closely with engineering teams to support product development and platform scalability
  • Ensure infrastructure and deployments are secure, robust, and aligned with best practices
  • Advocate for effective monitoring and reliability considerations throughout the development lifecycle
  • Support ongoing compliance with information security standards including ISO 27001

Requirements

  • Minimum 2:1 degree in Computer Science or a related field
  • 2+ years' experience in a DevOps, SRE, Platform Engineering or similar role
  • Experience configuring and using monitoring tools such as Grafana and Prometheus
  • Hands-on experience with cloud infrastructure, ideally GCP (Azure or AWS also considered)
  • Experience with Infrastructure-as-Code tools such as Terraform
  • Experience working with Docker, Kubernetes, and Helm
  • Strong understanding of cloud security and reliability best practices
  • Scripting experience using Python and/or Bash
  • Experience using Git within a professional software development environment
  • Strong problem-solving and analytical skills with a proactive mindset

Desirable:

  • Experience responding to and investigating security or reliability incidents in distributed cloud environments
  • Ability to communicate technical challenges to non-technical stakeholders
  • Familiarity with technologies such as NGINX, Flask (Python), React (TypeScript), PostgreSQL,
  • OpenSearch, Valkey, or Keycloak
  • Experience administering Linux-based systems
  • Experience with CI tools such as CircleCI
  • Exposure to information security compliance standards (e.g. ISO 27001)
  • Experience working within Agile development environments

Benefits & conditions

Salary: Negotiable, * A hands-on SRE role with exposure to modern cloud-native technologies and infrastructure

  • The opportunity to work on complex, real-world problems within industrial R&D environments
  • A collaborative, high-calibre engineering team within a growing Cambridge-based business
  • A competitive salary and benefits package

About the company

An exciting opportunity for a Site Reliability Engineer to join an award-winning, Cambridge-based AI software company at the forefront of machine learning innovation.

Apply for this position