Site Reliability Engineer
RedTech Recruitment
Cambridge, United Kingdom
2 days ago
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
IntermediateJob location
Cambridge, United Kingdom
Tech stack
Agile Methodologies
Amazon Web Services (AWS)
Azure
Bash
Cloud Computing
Cloud Computing Security
Continuous Integration
DevOps
Monitoring of Systems
Integrated Development Environments
Python
PostgreSQL
Linux System Administration
Nginx
Reliability Engineering
Prometheus
TypeScript
CircleCI
Scripting (Bash/Python/Go/Ruby)
Okta
React
Flask
Grafana
GIT
Kubernetes
Infrastructure Automation Frameworks
Information Technology
Terraform
Docker
Job description
- Develop and enhance monitoring systems to proactively identify performance, reliability, security, and cost issues
- Monitor platform performance and communicate insights to engineering teams
- Support incident response and assist with remediation of platform vulnerabilities
- Identify, plan, and implement improvements to cloud infrastructure and deployment processes
- Work closely with engineering teams to support product development and platform scalability
- Ensure infrastructure and deployments are secure, robust, and aligned with best practices
- Advocate for effective monitoring and reliability considerations throughout the development lifecycle
- Support ongoing compliance with information security standards including ISO 27001
Requirements
- Minimum 2:1 degree in Computer Science or a related field
- 2+ years' experience in a DevOps, SRE, Platform Engineering or similar role
- Experience configuring and using monitoring tools such as Grafana and Prometheus
- Hands-on experience with cloud infrastructure, ideally GCP (Azure or AWS also considered)
- Experience with Infrastructure-as-Code tools such as Terraform
- Experience working with Docker, Kubernetes, and Helm
- Strong understanding of cloud security and reliability best practices
- Scripting experience using Python and/or Bash
- Experience using Git within a professional software development environment
- Strong problem-solving and analytical skills with a proactive mindset
Desirable:
- Experience responding to and investigating security or reliability incidents in distributed cloud environments
- Ability to communicate technical challenges to non-technical stakeholders
- Familiarity with technologies such as NGINX, Flask (Python), React (TypeScript), PostgreSQL,
- OpenSearch, Valkey, or Keycloak
- Experience administering Linux-based systems
- Experience with CI tools such as CircleCI
- Exposure to information security compliance standards (e.g. ISO 27001)
- Experience working within Agile development environments
Benefits & conditions
Salary: Negotiable, * A hands-on SRE role with exposure to modern cloud-native technologies and infrastructure
- The opportunity to work on complex, real-world problems within industrial R&D environments
- A collaborative, high-calibre engineering team within a growing Cambridge-based business
- A competitive salary and benefits package
About the company
An exciting opportunity for a Site Reliability Engineer to join an award-winning, Cambridge-based AI software company at the forefront of machine learning innovation.