Senior Site Reliability Engineer
Role details
Job location
Tech stack
Job description
About team; This diverse team of Engineers in assisting multiple product teams as we continue to innovate all of our products within our global Cloud AWS landscape., * Designing, deploying, and maintaining highly available, scalable Kubernetes clusters on AWS EKS as well as the supporting ecosystem.
- Managing and optimizing cross-portfolio cloud infrastructure, leveraging AWS services and supported organizational tooling
- Developing and maintaining Infrastructure as Code (IaC) solutions to automate provisioning and management of cloud and Kubernetes resources.
- Writing automation processes to streamline operational workflows, incident response, and infrastructure management.
- Implementing CI/CD pipelines to facilitate deployments, testing, and validation.
- Supporting multi-regional critical infrastructure, ensuring high availability and rapid incident resolution. Monitoring system health, instrument system components, troubleshoot issues, and perform root cause analysis.
- Managing and supporting a complex cross-portfolio environment, coordinating across teams to ensure consistency and reliability.
- Maintaining comprehensive documentation and best practice guides for solutions, ensuring users have clear instructions and support to effectively implement and operate their systems.
- Mentoring junior team members and promoting best practices in SRE, automation, and cloud architecture.
Requirements
About the role, We are looking to immediately hire a highly skilled and proactive Senior SRE to join our dynamic team. You will combine software thinking and service operations to enable and run Elsevier's large-scale, 24x7, distributed and fault-tolerant systems within agreed reliability objectives, whilst enabling the fast flow of feature and service updates. The successful candidate will possess deep expertise in cloud-native architectures, along with strong automation skills., * Extensive experience deploying, managing, and troubleshooting containerised applications.
- Deep understanding of Kubernetes architecture, networking, security, storage, and operational best practices.
- Proven expertise with AWS services and architectural principles.
- Extensive knowledge of AWS security, compliance, and best practices.
- Advanced skills in writing modular, reusable IaC components.
- Strong Python scripting skills for automation, tooling, and data processing.
- Ability to develop custom solutions for monitoring, automation, and incident management. Experience designing and maintaining CI/CD workflows using GitHub Actions.
- Curren experience Automating deployment pipelines, testing, and validation processes.
- Familiarity with monitoring tools such as NewRelic. Knowledge of security best practices, network policies, and enterprise-grade RBAC policies.