Platform Engineer - Cloud & Infrastructure Automation and Observation

protein works.
Liverpool, United Kingdom
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Liverpool, United Kingdom

Tech stack

Microsoft Windows
Microsoft Active Directory
Domain Controllers
API
Artificial Intelligence
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Amazon Web Services (AWS)
JIRA
Azure
Bash
Ubuntu (Operating System)
CentOS
Cloud Computing
Cloud Computing Security
Software Quality
Code Review
Databases
Continuous Integration
Database Schema
Linux
Network Address Translation
DevOps
Disaster Recovery
DNS
Elasticsearch
File Server
Github
Hyper-V
Identity and Access Management
Image Management
Issue Tracking Systems
Networking Hardware
Subnetting
Virtual Private Networks (VPN)
Python
Network Security
PostgreSQL
Linux System Administration
Linux Servers
Load Testing
Log Analysis
MariaDB
Microsoft SQL Server
Windows Server
MySQL
Network Segmentation
PCI Data Security Standards
Peering
Powershell
Red Hat Enterprise Linux - RHEL
Redis
Reliability Engineering
Ansible
Prometheus
Server Administration
Session Management
System Programming
Virtual Machines
Software Vulnerability Management
Zabbix
Datadog
Data Logging
Pulumi
Scripting (Bash/Python/Go/Ruby)
Google Cloud Platform
Load Balancing
Amazon Web Services (AWS)
Grafana
Software Security
Boto3
Caching
Firewalls (Computer Science)
Amazon Web Services (AWS)
GIT
Cloudformation
Containerization
Gitlab-ci
Kubernetes
Information Technology
Cloudflare
Route53
Functional Programming
Cloudwatch
Terraform
New Relic (SaaS)
Software Version Control
Devsecops
Docker
Jenkins
Vulnerability Analysis
Go

Job description

We are seeking a Platform Engineer to join our technology team and play a central role in managing and automating our hybrid cloud and on-premises infrastructure. Working closely with the Technology Director, Development and IT & Systems teams, you will help drive automation, reliability and operational excellence across the full technology estate.

Our infrastructure operates across a hybrid model spanning multiple cloud providers and on-premises environments, supporting a fast-growing, high-volume e-commerce operation. You will champion Infrastructure as Code, build robust CI/CD and deployment pipelines, establish

comprehensive observability, and drive the cultural shift towards modern DevOps practices across the engineering organisation.

Key Responsibilities

Infrastructure Automation & Management:

Infrastructure as Code: Define, provision and manage cloud and on-premises infrastructure using IaC tools (CloudFormation, Terraform, Ansible or similar), eliminating manual configuration and ensuring repeatable, version-controlled environments

Hybrid Cloud Management: Manage and optimise infrastructure across multiple cloud providers and on-premises environments, ensuring consistent governance, security and cost efficiency across the entire estate

On-Premises & Local Infrastructure: Work alongside the IT & Systems team to manage local server infrastructure including Windows Server environments (Domain Controllers, Hyper-V, application and file servers), Linux systems and network security appliances; use IaC tools such as Terraform and Packer to automate the provisioning of local virtual machines and container clusters, ensuring local environments match production standards

Infrastructure Lifecycle Management: Oversee server maintenance, security patching, storage provisioning and networking equipment management across both cloud and local infrastructure, ensuring consistent standards regardless of where workloads run

CI/CD, Deployments & Release Engineering:

CI/CD Pipeline Development: Design, build and maintain continuous integration and deployment pipelines using GitHub Actions, Cloud Build and related tooling, enabling rapid, reliable releases across all environments

Controlled Rollouts & Deployment Strategies: Implement blue-green deployments, canary releases and rolling updates for application, database and infrastructure changes, minimising disruption and enabling safe rollback

Database Deployments: Manage and automate database schema migrations and deployments, ensuring zero-downtime releases through controlled rollout strategies

Runtime Mitigation: Utilise tooling to patch or isolate vulnerable containers in production without interrupting service, enabling rapid response to security findings

Build Reliability: Monitor pipeline health and implement automated alerting for build failures, ensuring the team addresses delivery blockers immediately

Observability, Monitoring & Alerting:

Full-Stack Observability: Architect and maintain a comprehensive observability strategy across all systems, consolidating and extending existing monitoring infrastructure (Zabbix,

CloudWatch) with modern tooling such as Grafana, Loki, New Relic or Datadog to ensure proactive alerting and full visibility

Automated Incident Management: Set up integrations between monitoring tools and Jira Service Management to automatically generate incident tickets when production systems fail or breach performance thresholds, and automate ticket triage, prioritisation and escalation

Workflows

Pipeline & Build Alerting: Configure automation to raise Jira tasks or bugs when critical deployment pipelines fail, ensuring delivery blockers are tracked and resolved promptly

Visibility & Reporting: Build dashboards and automated reporting for incident tracking, post-mortem outcomes and system health, providing transparency to engineering leadership

Security & Vulnerability Management:

Cloud Security Posture: Maintain and enhance security tooling including GuardDuty, Security Hub, Macie and Inspector; manage secrets, IAM policies and network segmentation to ensure compliance with PCI-DSS and data protection requirements

DevSecOps Integration: Integrate application security scanning tools such as Snyk into CI/CD pipelines, shifting security left and embedding vulnerability detection into the development workflow

Reliability, Cost & Performance:

Reliability Engineering: Implement SLIs, SLOs and error budgets; design and conduct game days and disaster recovery exercises; lead incident response and blameless post-mortems to continuously improve system resilience

Capacity Planning: Proactively manage capacity across non-autoscaling and autoscaling

architectures, ensuring readiness for peak trading events (Black Friday, Cyber Monday, seasonal promotions) through load testing and performance benchmarking

Cost Management: Monitor and optimise spend across cloud providers and local infrastructure, implementing tagging strategies, right-sizing recommendations and reserved/spot instance policies; work with IT & Systems to manage hardware lifecycle costs, storage provisioning and networking equipment budgets

AI-Driven Operations & Proactive Optimisation:

AIOps & Intelligent Monitoring: Leverage AI-driven tools for anomaly detection, predictive alerting and proactive system optimisation, reducing mean time to detection and resolution

AI-Enhanced CI/CD: Explore and implement AI-assisted pipeline optimisation, intelligent test selection and automated code quality analysis to accelerate delivery

Resource & Cost Optimisation: Use AI-powered recommendations for infrastructure right-sizing, workload scheduling and cost forecasting across the hybrid estate

Compliance & Data Hygiene: Apply AI tooling to automate compliance checks, configuration drift detection and data hygiene across environments

Collaboration & Documentation:

Cross-Team Collaboration: Liaise closely with Development, IT & Systems and Data teams to ensure system uptime, support deployment workflows, unblock developer productivity and align infrastructure decisions with business objectives

Documentation & Knowledge Sharing: Maintain comprehensive runbooks, architecture documentation and disaster recovery plans; champion DevOps best practices and mentor team members on infrastructure tooling and processes

Requirements

Do you have experience in VPN?, 5+ years' commercial experience in a DevOps, Site Reliability or Infrastructure Engineering role within a hybrid cloud and on-premises environment

Extensive hands-on experience with AWS services including EC2, RDS, ECS, ElastiCache, S3, CloudFront, Route 53, Lambda, IAM, VPC networking and WAF

Strong understanding of cloud billing models, cost allocation and optimisation strategies

Proficiency with AWS CloudFormation for infrastructure provisioning; experience with Terraform or Pulumi is a plus

Experience with container orchestration using ECS and/or Kubernetes

Extensive experience with Docker, including containerisation, image management and multi-stage builds

Experience with Cloudflare or similar CDN, edge security and DNS management services

Proficiency in Linux administration (Ubuntu, CentOS/RHEL) with solid understanding of networking fundamentals: DNS, load balancing, VPNs, firewalls, subnets, NAT and VPC peering

Languages & Scripting (Required):

Python: Essential. Used extensively for writing custom security and automation tooling, interacting with cloud APIs (Boto3), log analysis and DevSecOps workflows

Bash/Shell: Essential. Required for writing Docker container entrypoint scripts, automating Linux server tasks and managing CI/CD runner environments

CI/CD & Deployment:

Experience building and maintaining CI/CD pipelines with GitHub Actions, Jenkins, GitLab CI or similar

Proven experience implementing blue-green deployments, canary releases and controlled rollout strategies for application, database and infrastructure changes

Version control best practices with Git, including branching strategies and code review workflows

Security & Compliance:

Experience with cloud security tooling including GuardDuty, Security Hub, Macie and Inspector

Familiarity with application security scanning tools such as Snyk or equivalent

Working knowledge of IAM best practices, secrets management, network segmentation and encryption at rest and in transit

Awareness of PCI-DSS requirements in an e-commerce context

Observability & Reliability:

Proven experience implementing monitoring, logging and alerting using tools such as New Relic, Grafana, Loki, CloudWatch, Prometheus or Datadog

Understanding of SRE principles: SLIs, SLOs, error budgets, incident management and blameless post-mortems

Experience with log aggregation and analysis (ELK/OpenSearch, CloudWatch Logs)

Experience integrating monitoring and CI/CD tooling with service management platforms (e.g. Jira Service Management) for automated ticket creation and incident workflows

Desirable Skills

Experience with system programming languages such as Golang

Experience with Microsoft Azure, ideally including Virtual Machines, SQL Server and Active Directory

Experience with Google Cloud Platform, ideally including Cloud Functions, Cloud Scheduler, Pub/Sub and Cloud Build

Familiarity with Windows Server administration (Domain Controllers, Hyper-V, Group Policy); candidates without this experience should be comfortable learning it as part of the role

Familiarity with PowerShell for Windows automation

Experience with database administration across MySQL, MariaDB, PostgreSQL and SQL Server, including replication and failover strategies

Exposure to Elasticsearch/OpenSearch cluster management

Experience managing Redis/ElastiCache clusters for caching and session management

  • Experience with AI-driven operations tooling (AIOps, ChatOps, intelligent monitoring)

Experience with high-volume e-commerce environments and peak traffic management

Mathematics or Computer Science degree (or equivalent experience)

Apply for this position