SRE (Site Reliability Engineer)

Go Arrow

3 days ago

Role details

Contract type

Temporary contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Compensation

£ 122K

Job location

Tech stack

API

Amazon Web Services (AWS)

Server Applications

Azure

Bash

Oracle WebLogic Server

Ubuntu (Operating System)

CentOS

Cloud Computing

Configuration Management

Computer Networks

Databases

Continuous Integration

Linux

DevOps

Disaster Recovery

Distributed Systems

DNS

Elasticsearch

Perl

Github

Groovy

Web Servers

IBM Websphere Application Server

WildFly (JBoss AS)

Python

Shell

Microsoft SQL Server

Team Foundation Server

Windows Server

MySQL

Nagios

Nginx

OpenStack

Oracle

Powershell

Systems Development Life Cycle

Ruby on Rails

Release Management

Reliability Engineering

Ansible

Ruby

Software Deployment

Software Engineering

Transmission Control Protocol (TCP)

T-SQL

Virtual Machines

Web Services

Scripting (Bash/Python/Go/Ruby)

Google Cloud Platform

Sql Optimization

Reliability of Systems

Firewalls (Computer Science)

Gitlab

Gitlab-ci

Kubernetes

Puppet

REST

Terraform

Splunk

New Relic (SaaS)

Software Version Control

Docker

Jenkins

VMware

Microservices

Job description

We are seeking a highly skilled Site Reliability Engineer to join our dynamic IT team. The successful candidate will be responsible for ensuring the stability, scalability, and performance of our cloud-based and on-premise systems. This role involves developing automation solutions, managing infrastructure, and supporting software deployment processes across diverse environments. The ideal applicant will possess a strong background in system administration, cloud computing, and software development, with a keen eye for troubleshooting and incident management. This is an excellent opportunity for professionals passionate about maintaining high-availability systems and driving continuous improvement in system reliability., * Design, implement, and maintain scalable and reliable infrastructure using tools such as Kubernetes, Terraform, Ansible, Puppet, Chef, and VMware.

Monitor system performance with tools like New Relic, Splunk, Elasticsearch, and Nagios to proactively identify issues before they impact users.
Automate deployment pipelines leveraging Jenkins, GitLab CI/CD, TFS, and other continuous integration tools to streamline software releases.
Manage cloud environments including AWS, Azure, Google Cloud Platform (GCP), and OpenStack to optimise resource utilisation and cost-efficiency.
Develop scripts using PowerShell, Bash (Unix shell), Python, Ruby, Perl, Groovy, or Go to automate routine tasks and improve operational efficiency.
Troubleshoot complex issues related to web services such as REST APIs, web servers like NGINX or WebSphere, application servers including Weblogic or JBoss.
Implement disaster recovery plans and perform incident response activities to minimise downtime during outages or security breaches.
Collaborate with development teams on requirements gathering for new features or system upgrades following SDLC best practices.
Maintain comprehensive documentation of system configurations and procedures aligned with ITIL standards for release management and change control.

Requirements

Do you have experience in Weblogic?, * Proven experience in a Site Reliability Engineering or DevOps role within a large-scale enterprise environment.

Extensive knowledge of containerisation technologies such as Docker and Kubernetes.
Hands-on experience with cloud platforms including AWS (Amazon S3, EC2), Azure (Virtual Machines), Google Cloud Platform (GCP), or OpenStack.
Strong proficiency in scripting languages such as Python, PowerShell, Bash (Unix shell), Ruby on Rails or Groovy for automation tasks.
Familiarity with configuration management tools like Ansible, Puppet, Chef; version control systems including GitHub or GitLab; and CI/CD pipelines using Jenkins or TFS.
Experience managing distributed systems architecture involving microservices and APIs over TCP/IP networks.
Knowledge of databases including MySQL, Microsoft SQL Server (T-SQL), Oracle DBMS; along with experience in SQL optimisation and disaster recovery planning.
Understanding of computer networking concepts such as DNS, TCP/IP protocols, firewalls, LAN/WAN configurations.
Ability to troubleshoot software issues across various platforms including Linux (CentOS/Ubuntu) and Windows Server environments. This role offers an engaging environment where technical expertise is valued and professional growth is encouraged through exposure to cutting-edge technology stacks and best practices in system reliability engineering.