Platform Engineer
Role details
Job location
Tech stack
Job description
We're looking for a Platform Engineer with strong site reliability principles to join our Platform team. You'll focus on maintaining and improving production reliability, automating operational tasks, and enhancing our observability stack. You'll work closely with SREs, support engineers, release managers, and incident managers to ensure our systems meet SLIs, SLOs, and SLA targets.
Key Responsibilities
Maintain and optimise production environments in AWS (EKS, EC2, RDS/Aurora, S3).
Develop and maintain Infrastructure as Code using Terraform and configuration management with Ansible.
Enhance monitoring, logging, and alerting using the Grafana stack (Prometheus, Loki, Tempo).
Participate in incident management, root cause analysis, and post-incident reviews.
Implement automation to reduce manual operational tasks and improve recovery time.
Contribute to the definition and tracking of SLIs, SLOs, and error budgets.
Collaborate with release and support teams to ensure smooth, reliable rollouts.
Maintain and improve documentation for operational runbooks and platform processes.
Requirements
Solid experience managing Kubernetes clusters (AWS EKS) in production.
Proficient with AWS services relevant to production workloads (EKS, EC2,
RDS/Aurora, S3, IAM).
Infrastructure as Code with Terraform and configuration management with Ansible.
Strong experience with observability tools (Grafana, Prometheus, Loki, Tempo).
Understanding of SRE concepts (SLIs, SLOs, error budgets, capacity planning).
Comfortable working in incident and problem management processes.
Strong GitOps mindset for managing platform and configuration changes.
Good communication and documentation skills.
Qualifications (Desirable)
Certified Kubernetes Administrator (CKA) and/or Certified Kubernetes Security Specialist (CKS).
AWS Certified Solutions Architect - Associate and/or AWS Certified
DevOps Engineer - Professional.
Nice-to-Have
Experience with Python scripting for automation and reliability tooling.
Knowledge of Java and/or React application deployments in production.
Prior experience in high-volume, high-availability environments.