Platform Engineer - Cloud & Infrastructure Automation and Observation

protein works.

Liverpool, United Kingdom

4 months ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Liverpool, United Kingdom

Tech stack

Microsoft Windows

Microsoft Active Directory

Domain Controllers

API

Artificial Intelligence

Amazon Web Services (AWS)

JIRA

Azure

Bash

Ubuntu (Operating System)

CentOS

Cloud Computing

Cloud Computing Security

Software Quality

Code Review

Databases

Continuous Integration

Database Schema

Linux

Network Address Translation

DevOps

Disaster Recovery

DNS

Elasticsearch

File Server

Github

Hyper-V

Identity and Access Management

Image Management

Issue Tracking Systems

Networking Hardware

Subnetting

Virtual Private Networks (VPN)

Python

Network Security

PostgreSQL

Linux System Administration

Linux Servers

Load Testing

Log Analysis

MariaDB

Microsoft SQL Server

Windows Server

MySQL

Network Segmentation

PCI Data Security Standards

Peering

Powershell

Red Hat Enterprise Linux - RHEL

Redis

Reliability Engineering

Ansible

Prometheus

Server Administration

Session Management

System Programming

Virtual Machines

Software Vulnerability Management

Zabbix

Datadog

Data Logging

Pulumi

Scripting (Bash/Python/Go/Ruby)

Google Cloud Platform

Load Balancing

Amazon Web Services (AWS)

Grafana

Software Security

Boto3

Caching

Firewalls (Computer Science)

Amazon Web Services (AWS)

GIT

Cloudformation

Containerization

Gitlab-ci

Kubernetes

Information Technology

Cloudflare

Route53

Functional Programming

Cloudwatch

Terraform

New Relic (SaaS)

Software Version Control

Devsecops

Docker

Jenkins

Vulnerability Analysis

Job description

We are seeking a Platform Engineer to join our technology team and play a central role in managing and automating our hybrid cloud and on-premises infrastructure. Working closely with the Technology Director, Development and IT & Systems teams, you will help drive automation, reliability and operational excellence across the full technology estate.

Our infrastructure operates across a hybrid model spanning multiple cloud providers and on-premises environments, supporting a fast-growing, high-volume e-commerce operation. You will champion Infrastructure as Code, build robust CI/CD and deployment pipelines, establish

comprehensive observability, and drive the cultural shift towards modern DevOps practices across the engineering organisation.

Key Responsibilities

Infrastructure Automation & Management:

Infrastructure as Code: Define, provision and manage cloud and on-premises infrastructure using IaC tools (CloudFormation, Terraform, Ansible or similar), eliminating manual configuration and ensuring repeatable, version-controlled environments

Hybrid Cloud Management: Manage and optimise infrastructure across multiple cloud providers and on-premises environments, ensuring consistent governance, security and cost efficiency across the entire estate

On-Premises & Local Infrastructure: Work alongside the IT & Systems team to manage local server infrastructure including Windows Server environments (Domain Controllers, Hyper-V, application and file servers), Linux systems and network security appliances; use IaC tools such as Terraform and Packer to automate the provisioning of local virtual machines and container clusters, ensuring local environments match production standards

Infrastructure Lifecycle Management: Oversee server maintenance, security patching, storage provisioning and networking equipment management across both cloud and local infrastructure, ensuring consistent standards regardless of where workloads run

CI/CD, Deployments & Release Engineering:

CI/CD Pipeline Development: Design, build and maintain continuous integration and deployment pipelines using GitHub Actions, Cloud Build and related tooling, enabling rapid, reliable releases across all environments

Controlled Rollouts & Deployment Strategies: Implement blue-green deployments, canary releases and rolling updates for application, database and infrastructure changes, minimising disruption and enabling safe rollback

Database Deployments: Manage and automate database schema migrations and deployments, ensuring zero-downtime releases through controlled rollout strategies

Runtime Mitigation: Utilise tooling to patch or isolate vulnerable containers in production without interrupting service, enabling rapid response to security findings

Build Reliability: Monitor pipeline health and implement automated alerting for build failures, ensuring the team addresses delivery blockers immediately

Observability, Monitoring & Alerting:

Full-Stack Observability: Architect and maintain a comprehensive observability strategy across all systems, consolidating and extending existing monitoring infrastructure (Zabbix,

CloudWatch) with modern tooling such as Grafana, Loki, New Relic or Datadog to ensure proactive alerting and full visibility

Automated Incident Management: Set up integrations between monitoring tools and Jira Service Management to automatically generate incident tickets when production systems fail or breach performance thresholds, and automate ticket triage, prioritisation and escalation

Workflows

Pipeline & Build Alerting: Configure automation to raise Jira tasks or bugs when critical deployment pipelines fail, ensuring delivery blockers are tracked and resolved promptly

Visibility & Reporting: Build dashboards and automated reporting for incident tracking, post-mortem outcomes and system health, providing transparency to engineering leadership

Security & Vulnerability Management:

Cloud Security Posture: Maintain and enhance security tooling including GuardDuty, Security Hub, Macie and Inspector; manage secrets, IAM policies and network segmentation to ensure compliance with PCI-DSS and data protection requirements

DevSecOps Integration: Integrate application security scanning tools such as Snyk into CI/CD pipelines, shifting security left and embedding vulnerability detection into the development workflow

Reliability, Cost & Performance:

Reliability Engineering: Implement SLIs, SLOs and error budgets; design and conduct game days and disaster recovery exercises; lead incident response and blameless post-mortems to continuously improve system resilience

Capacity Planning: Proactively manage capacity across non-autoscaling and autoscaling

architectures, ensuring readiness for peak trading events (Black Friday, Cyber Monday, seasonal promotions) through load testing and performance benchmarking

Cost Management: Monitor and optimise spend across cloud providers and local infrastructure, implementing tagging strategies, right-sizing recommendations and reserved/spot instance policies; work with IT & Systems to manage hardware lifecycle costs, storage provisioning and networking equipment budgets

AI-Driven Operations & Proactive Optimisation:

AIOps & Intelligent Monitoring: Leverage AI-driven tools for anomaly detection, predictive alerting and proactive system optimisation, reducing mean time to detection and resolution

AI-Enhanced CI/CD: Explore and implement AI-assisted pipeline optimisation, intelligent test selection and automated code quality analysis to accelerate delivery

Resource & Cost Optimisation: Use AI-powered recommendations for infrastructure right-sizing, workload scheduling and cost forecasting across the hybrid estate

Compliance & Data Hygiene: Apply AI tooling to automate compliance checks, configuration drift detection and data hygiene across environments

Collaboration & Documentation:

Cross-Team Collaboration: Liaise closely with Development, IT & Systems and Data teams to ensure system uptime, support deployment workflows, unblock developer productivity and align infrastructure decisions with business objectives

Documentation & Knowledge Sharing: Maintain comprehensive runbooks, architecture documentation and disaster recovery plans; champion DevOps best practices and mentor team members on infrastructure tooling and processes

Requirements

Do you have experience in VPN?, 5+ years' commercial experience in a DevOps, Site Reliability or Infrastructure Engineering role within a hybrid cloud and on-premises environment

Extensive hands-on experience with AWS services including EC2, RDS, ECS, ElastiCache, S3, CloudFront, Route 53, Lambda, IAM, VPC networking and WAF

Strong understanding of cloud billing models, cost allocation and optimisation strategies

Proficiency with AWS CloudFormation for infrastructure provisioning; experience with Terraform or Pulumi is a plus

Experience with container orchestration using ECS and/or Kubernetes

Extensive experience with Docker, including containerisation, image management and multi-stage builds

Experience with Cloudflare or similar CDN, edge security and DNS management services

Proficiency in Linux administration (Ubuntu, CentOS/RHEL) with solid understanding of networking fundamentals: DNS, load balancing, VPNs, firewalls, subnets, NAT and VPC peering

Languages & Scripting (Required):

Python: Essential. Used extensively for writing custom security and automation tooling, interacting with cloud APIs (Boto3), log analysis and DevSecOps workflows

Bash/Shell: Essential. Required for writing Docker container entrypoint scripts, automating Linux server tasks and managing CI/CD runner environments

CI/CD & Deployment:

Experience building and maintaining CI/CD pipelines with GitHub Actions, Jenkins, GitLab CI or similar

Proven experience implementing blue-green deployments, canary releases and controlled rollout strategies for application, database and infrastructure changes

Version control best practices with Git, including branching strategies and code review workflows

Security & Compliance:

Experience with cloud security tooling including GuardDuty, Security Hub, Macie and Inspector

Familiarity with application security scanning tools such as Snyk or equivalent

Working knowledge of IAM best practices, secrets management, network segmentation and encryption at rest and in transit

Awareness of PCI-DSS requirements in an e-commerce context

Observability & Reliability:

Proven experience implementing monitoring, logging and alerting using tools such as New Relic, Grafana, Loki, CloudWatch, Prometheus or Datadog

Understanding of SRE principles: SLIs, SLOs, error budgets, incident management and blameless post-mortems

Experience with log aggregation and analysis (ELK/OpenSearch, CloudWatch Logs)

Experience integrating monitoring and CI/CD tooling with service management platforms (e.g. Jira Service Management) for automated ticket creation and incident workflows

Desirable Skills

Experience with system programming languages such as Golang

Experience with Microsoft Azure, ideally including Virtual Machines, SQL Server and Active Directory

Experience with Google Cloud Platform, ideally including Cloud Functions, Cloud Scheduler, Pub/Sub and Cloud Build

Familiarity with Windows Server administration (Domain Controllers, Hyper-V, Group Policy); candidates without this experience should be comfortable learning it as part of the role

Familiarity with PowerShell for Windows automation

Experience with database administration across MySQL, MariaDB, PostgreSQL and SQL Server, including replication and failover strategies

Exposure to Elasticsearch/OpenSearch cluster management

Experience managing Redis/ElastiCache clusters for caching and session management

Experience with AI-driven operations tooling (AIOps, ChatOps, intelligent monitoring)

Experience with high-volume e-commerce environments and peak traffic management

Mathematics or Computer Science degree (or equivalent experience)

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all