Senior DevOps Engineer
Role details
Job location
Tech stack
Job description
As a Senior DevOps Engineer within the Cloud team, you will be responsible for building and maintaining the infrastructure that powers Fast Track at scale. Today, we operate hundreds of brands across dozens of Kubernetes clusters on AWS, and our ambition is to turn infrastructure into a competitive advantage, not a bottleneck.
To be successful in this role, you will need strong hands-on experience with AWS, Kubernetes, and Infrastructure as Code, and you should enjoy working on automation that makes a real difference to how teams ship and operate software.
You won't just maintain what exists. You'll help drive the next phase of our cloud operations: improving the current GitOps workflow, automating the environment lifecycle, enhancing our CI/CD pipelines, and making our infrastructure more observable, cost-efficient, and resilient.
Our direction is clear:
- Remove any manual provisioning steps from the environment lifecycle
- Consolidate, standardise IaC modules across all clusters and accounts
- Improve observability, reliability, and disaster recovery capabilities
- Reduce cloud spend through right-sizing, tiered capacity, and cost visibility
- Keep our stack simple and up to date
This role is key to scaling our infrastructure operations and supporting Fast Track's continued growth., * Own and evolve Terraform modules, keeping them up to date with the latest provider features and team standards
- Manage EKS cluster upgrades across standalone and shared accounts, including add-on validation and rollout coordination
- Automate environment provisioning and teardown, including resource cleanup for decommissioned brands
- Analyse and improve CI/CD workflows in collaboration with development teams, ensuring rapid and reliable delivery
- Ensure deployment consistency between staging, production, and standalone environments
Performance, Reliability and Cost
- Design and reconfigure Kubernetes deployments for optimal performance, auto-scaling, and cost efficiency
- Ensure load balancers and client-facing integration points are highly performant and highly available
- Work with SRE to develop and improve monitoring, alerting, and proactive indicators using tools like Grafana and HyperDX
- Prepare and automate failover and disaster recovery runbooks
- Contribute to cost optimisation by right-sizing resources and implementing tier-based capacity models
Collaboration and Growth
- A mindset of continuously evaluating and adopting emerging AI trends and capabilities to raise the bar on infrastructure quality, speed, and operational maturity.
- Collaborate closely with Integrations, SRE, and development teams to simplify and automate operational processes
- Keep documentation for infrastructure and operational processes accurate and up to date
- Mentor junior team members and advocate for DevOps best practices across the organisation
Requirements
Do you have experience in Terraform?, * Proven experience implementing DevOps practices at scale in complex production environments.
- Proven ability to leverage AI-assisted tooling to accelerate engineering workflows, combined with the discipline to validate, test, and critically review AI-generated outputs before they reach production.
- AWS Proficiency: Deep understanding of AWS services, including EKS, EC2, RDS, S3, IAM, VPC, Route53, and CloudWatch. Experience managing multi-account AWS environments.
- Kubernetes Proficiency: Extensive hands-on experience with Kubernetes in production, including cluster upgrades, Helm chart management, HPA/VPA, cluster autoscaler, EKS add-ons, and troubleshooting.
- Infrastructure as Code: Strong command of Terraform with experience writing and maintaining reusable modules across multiple environments.
- CI/CD Expertise: Proficiency in CI/CD practices and pipeline design. Experience with GitHub Actions is highly advantageous.
- Containerisation & Linux: Solid experience with Docker, container best practices, and Linux environments. Proficiency in scripting for automation.
- Monitoring & Observability: Proficiency with monitoring and logging tools such as Grafana, HyperDX, and PagerDuty.
- Ownership & collaboration: A self-driven engineer who takes responsibility for outcomes, advocates for DevOps as a culture shift, and brings energy to a collaborative team environment.
- A passionate learner who actively seeks out the latest industry trends, best practices, and technologies. Willingness to lead internal workshops and training sessions.
- Technical leadership within projects, influencing best practices and driving execution across teams.
- Ability to independently navigate complex challenges and contribute strategically to the company's long-term goals.
- Strong communication skills in English, and can convey complex ideas in simple terms.
Nice to Have
- Cloudflare: Experience with Cloudflare for DNS management, CDN, and edge configuration.
- Familiarity with our application stack: Golang, Aurora MySQL, ClickHouse, Kafka, or RabbitMQ. You don't need to develop in these, but understanding how they behave in production helps with infrastructure decisions.
- API awareness: Comfort working with REST APIs for automation and tooling integration (AWS APIs, PagerDuty, Cloudflare, internal tooling).