Platform Engineer - Cloud & Infrastructure Automation and Observation
Role details
Job location
Tech stack
Job description
We are seeking a Platform Engineer to join our technology team and play a central role in managing and automating our hybrid cloud and on-premises infrastructure. Working closely with the Technology Director, Development and IT & Systems teams, you will help drive automation, reliability and operational excellence across the full technology estate.
Our infrastructure operates across a hybrid model spanning multiple cloud providers and on-premises environments, supporting a fast-growing, high-volume e-commerce operation. You will champion Infrastructure as Code, build robust CI/CD and deployment pipelines, establish
comprehensive observability, and drive the cultural shift towards modern DevOps practices across the engineering organisation.
Key Responsibilities
Infrastructure Automation & Management:
Infrastructure as Code: Define, provision and manage cloud and on-premises infrastructure using IaC tools (CloudFormation, Terraform, Ansible or similar), eliminating manual configuration and ensuring repeatable, version-controlled environments
Hybrid Cloud Management: Manage and optimise infrastructure across multiple cloud providers and on-premises environments, ensuring consistent governance, security and cost efficiency across the entire estate
On-Premises & Local Infrastructure: Work alongside the IT & Systems team to manage local server infrastructure including Windows Server environments (Domain Controllers, Hyper-V, application and file servers), Linux systems and network security appliances; use IaC tools such as Terraform and Packer to automate the provisioning of local virtual machines and container clusters, ensuring local environments match production standards
Infrastructure Lifecycle Management: Oversee server maintenance, security patching, storage provisioning and networking equipment management across both cloud and local infrastructure, ensuring consistent standards regardless of where workloads run
CI/CD, Deployments & Release Engineering:
CI/CD Pipeline Development: Design, build and maintain continuous integration and deployment pipelines using GitHub Actions, Cloud Build and related tooling, enabling rapid, reliable releases across all environments
Controlled Rollouts & Deployment Strategies: Implement blue-green deployments, canary releases and rolling updates for application, database and infrastructure changes, minimising disruption and enabling safe rollback
Database Deployments: Manage and automate database schema migrations and deployments, ensuring zero-downtime releases through controlled rollout strategies
Runtime Mitigation: Utilise tooling to patch or isolate vulnerable containers in production without interrupting service, enabling rapid response to security findings
Build Reliability: Monitor pipeline health and implement automated alerting for build failures, ensuring the team addresses delivery blockers immediately
Observability, Monitoring & Alerting:
Full-Stack Observability: Architect and maintain a comprehensive observability strategy across all systems, consolidating and extending existing monitoring infrastructure (Zabbix,
CloudWatch) with modern tooling such as Grafana, Loki, New Relic or Datadog to ensure proactive alerting and full visibility
Automated Incident Management: Set up integrations between monitoring tools and Jira Service Management to automatically generate incident tickets when production systems fail or breach performance thresholds, and automate ticket triage, prioritisation and escalation
Workflows
Pipeline & Build Alerting: Configure automation to raise Jira tasks or bugs when critical deployment pipelines fail, ensuring delivery blockers are tracked and resolved promptly
Visibility & Reporting: Build dashboards and automated reporting for incident tracking, post-mortem outcomes and system health, providing transparency to engineering leadership
Security & Vulnerability Management:
Cloud Security Posture: Maintain and enhance security tooling including GuardDuty, Security Hub, Macie and Inspector; manage secrets, IAM policies and network segmentation to ensure compliance with PCI-DSS and data protection requirements
DevSecOps Integration: Integrate application security scanning tools such as Snyk into CI/CD pipelines, shifting security left and embedding vulnerability detection into the development workflow
Reliability, Cost & Performance:
Reliability Engineering: Implement SLIs, SLOs and error budgets; design and conduct game days and disaster recovery exercises; lead incident response and blameless post-mortems to continuously improve system resilience
Capacity Planning: Proactively manage capacity across non-autoscaling and autoscaling
architectures, ensuring readiness for peak trading events (Black Friday, Cyber Monday, seasonal promotions) through load testing and performance benchmarking
Cost Management: Monitor and optimise spend across cloud providers and local infrastructure, implementing tagging strategies, right-sizing recommendations and reserved/spot instance policies; work with IT & Systems to manage hardware lifecycle costs, storage provisioning and networking equipment budgets
AI-Driven Operations & Proactive Optimisation:
AIOps & Intelligent Monitoring: Leverage AI-driven tools for anomaly detection, predictive alerting and proactive system optimisation, reducing mean time to detection and resolution
AI-Enhanced CI/CD: Explore and implement AI-assisted pipeline optimisation, intelligent test selection and automated code quality analysis to accelerate delivery
Resource & Cost Optimisation: Use AI-powered recommendations for infrastructure right-sizing, workload scheduling and cost forecasting across the hybrid estate
Compliance & Data Hygiene: Apply AI tooling to automate compliance checks, configuration drift detection and data hygiene across environments
Collaboration & Documentation:
Cross-Team Collaboration: Liaise closely with Development, IT & Systems and Data teams to ensure system uptime, support deployment workflows, unblock developer productivity and align infrastructure decisions with business objectives
Documentation & Knowledge Sharing: Maintain comprehensive runbooks, architecture documentation and disaster recovery plans; champion DevOps best practices and mentor team members on infrastructure tooling and processes
Requirements
Do you have experience in VPN?, 5+ years' commercial experience in a DevOps, Site Reliability or Infrastructure Engineering role within a hybrid cloud and on-premises environment
Extensive hands-on experience with AWS services including EC2, RDS, ECS, ElastiCache, S3, CloudFront, Route 53, Lambda, IAM, VPC networking and WAF
Strong understanding of cloud billing models, cost allocation and optimisation strategies
Proficiency with AWS CloudFormation for infrastructure provisioning; experience with Terraform or Pulumi is a plus
Experience with container orchestration using ECS and/or Kubernetes
Extensive experience with Docker, including containerisation, image management and multi-stage builds
Experience with Cloudflare or similar CDN, edge security and DNS management services
Proficiency in Linux administration (Ubuntu, CentOS/RHEL) with solid understanding of networking fundamentals: DNS, load balancing, VPNs, firewalls, subnets, NAT and VPC peering
Languages & Scripting (Required):
Python: Essential. Used extensively for writing custom security and automation tooling, interacting with cloud APIs (Boto3), log analysis and DevSecOps workflows
Bash/Shell: Essential. Required for writing Docker container entrypoint scripts, automating Linux server tasks and managing CI/CD runner environments
CI/CD & Deployment:
Experience building and maintaining CI/CD pipelines with GitHub Actions, Jenkins, GitLab CI or similar
Proven experience implementing blue-green deployments, canary releases and controlled rollout strategies for application, database and infrastructure changes
Version control best practices with Git, including branching strategies and code review workflows
Security & Compliance:
Experience with cloud security tooling including GuardDuty, Security Hub, Macie and Inspector
Familiarity with application security scanning tools such as Snyk or equivalent
Working knowledge of IAM best practices, secrets management, network segmentation and encryption at rest and in transit
Awareness of PCI-DSS requirements in an e-commerce context
Observability & Reliability:
Proven experience implementing monitoring, logging and alerting using tools such as New Relic, Grafana, Loki, CloudWatch, Prometheus or Datadog
Understanding of SRE principles: SLIs, SLOs, error budgets, incident management and blameless post-mortems
Experience with log aggregation and analysis (ELK/OpenSearch, CloudWatch Logs)
Experience integrating monitoring and CI/CD tooling with service management platforms (e.g. Jira Service Management) for automated ticket creation and incident workflows
Desirable Skills
Experience with system programming languages such as Golang
Experience with Microsoft Azure, ideally including Virtual Machines, SQL Server and Active Directory
Experience with Google Cloud Platform, ideally including Cloud Functions, Cloud Scheduler, Pub/Sub and Cloud Build
Familiarity with Windows Server administration (Domain Controllers, Hyper-V, Group Policy); candidates without this experience should be comfortable learning it as part of the role
Familiarity with PowerShell for Windows automation
Experience with database administration across MySQL, MariaDB, PostgreSQL and SQL Server, including replication and failover strategies
Exposure to Elasticsearch/OpenSearch cluster management
Experience managing Redis/ElastiCache clusters for caching and session management
-
Experience with AI-driven operations tooling (AIOps, ChatOps, intelligent monitoring)
Experience with high-volume e-commerce environments and peak traffic management
Mathematics or Computer Science degree (or equivalent experience)