Site Reliability Engineer - AWS & Azure
Role details
Job location
Tech stack
Job description
We are seeking a highly skilled Site Reliability Engineer (SRE) with expertise in both Azure and AWS cloud platforms. This position is responsible for taking a lead role in migrating an existing on-prem HPC solution into Cloud, enhancing the reliability, scalability, and performance of that cloud infrastructure through automation, software engineering practices, and proactive system management. The ideal candidate will bridge the gap between development and operations, applying a software engineering mindset to IT operations and infrastructure., * Work with existing solutions already in place in the US to redefine, implement, and maintain scalable, reliable cloud infrastructure across Azure and AWS for the UK business as a similar but separate entity.
- Develop automation scripts and tools to streamline operational tasks such as log analysis, environment testing, and incident response.
- Collaborate with development and operations teams to ensure seamless deployment and performance of applications and services.
- Monitor system performance and availability, proactively identifying and resolving issues.
- Apply software engineering principles to infrastructure management, improving efficiency and reducing manual effort.
- Deliver value by monitoring spending, optimizing resource usage, right-sizing and automation, and implement governance through tagging strategies and budget alerts.
- Document the solution and deliver knowledge transfer and training to existing team members.
Requirements
- Strong understanding of cloud-native architectures and services in Azure and AWS including AKS/EKS and it's automation.
- Experience with infrastructure-as-code tools (eg, Terraform).
- Familiarity with CI/CD pipelines, containerization (Docker, Kubernetes), and monitoring tools.
- Knowledge of data processing and configuration design.
- Experience with IT infrastructure and monitoring systems., * Bachelor's degree in Computer Science, Computer Engineering, Information Technology, or a related field.
- Extensive experience in site reliability engineering, DevOps, or cloud infrastructure roles.