Lead Site Reliability Engineer
Role details
Job location
Tech stack
Job description
You will play a key role in maintaining and evolving FutureLearn's platform to ensure it is highly available, reliable, secure, and scalable as the business grows. Working closely with the Lead Technical Architect, SREs, and software engineers, you'll help shape the technical direction of our infrastructure while fostering a strong DevOps culture that enables teams to deliver high-quality services safely and efficiently.
We're looking for people who are curious, thoughtful, and eager to learn, with a genuine desire to use their experience to support and enable others. You'll need to communicate clearly, work effectively in a collaborative environment, and be comfortable operating autonomously when needed.
What does success look like:
Maintaining platform availability and reliability
- Partner with the Lead Technical Architect to set and evolve the technical direction of our infrastructure, ensuring it scales to support business growth in a cost-effective manner.
- Take responsibility for a platform that is secure, resilient, scalable, and cost-efficient.
- Develop deep expertise in FutureLearn's technology stack and its practical application, including AWS (RDS, ECS, EC2, S3, Lambda), Cloudflare, Redis, DNS, Docker, and the wider infrastructure platform.
- Use, maintain, and continuously improve observability tooling such as Datadog and AWS CloudWatch to monitor platform health, troubleshoot performance issues, and identify root causes.
- Respond to incidents affecting the platform, including participation in the on-call rota.
- Ensure disaster recovery and incident response processes are regularly tested and improved, designing exercises informed by industry best practices such as gamedays and chaos engineering.
- Act as an expert in the tools used to manage infrastructure and CI/CD systems, including Terraform, GitHub Actions, and scripting languages.
Building a DevOps culture at FutureLearn
- Own and continuously improve the developer experience, supporting SREs in refining how the FutureLearn application is developed, tested, and deployed so it is safer, faster, and easier to work on.
- Champion CI/CD best practices, enabling engineers to reliably deliver high-quality services to production.
- Empower software engineers to understand how to get their code into production and how to identify and debug performance issues.
- Support engineers through pairing, teaching, mentoring, coaching, and code reviews, demonstrating the practices of an effective engineer.
- Act as a subject matter expert for infrastructure and operational concerns across FutureLearn.
Requirements
Do you have experience in Terraform?, * Hands-on experience with containers and schedulers (Amazon ECS).
- Experience using automated configuration management and infrastructure-as-code tools (Terraform).
- A deep understanding of Linux, networking, and security.
- Experience supporting database administration and performance, with a focus on scalability and maintainability.
- A strong interest in automation and improving the developer experience.
- Experience working closely with software engineers in an agile environment.
- A solid understanding of Git and version control best practices to structure and communicate work effectively.
Preferred (not essential)
- Programming experience in Ruby, JavaScript, or Go.
- Experience managing relationships with external suppliers such as AWS or Cloudflare.