Lead, Site Reliability Engineer
Role details
Job location
Tech stack
Job description
The Lead, Site Reliability Engineer (SRE) provides technical and strategic leadership for Royal Caribbean Group's DevOps and platform engineering ecosystem. This role defines standards, guides platform architecture, and drives enterprise-wide initiatives across CI/CD, Kubernetes, GitOps, observability, security, and AI-enabled automation to support reliable, scalable software delivery. The engineer will lead platform design and evolution, drive intelligent automation, and ensure robust integration of DevOps tooling with business processes, fostering operational excellence and innovation., * Owns SRE and DevOps strategy across AWS and Azure, architecting cloud patterns for high availability, disaster recovery, and cost optimization.
- Leads Kubernetes/Helm platform design and evolution (EKS, AKS) supporting production workloads.
- Drives AI-assisted SRE capabilities by identifying opportunities for intelligent automation, remediation, and operational insights across CI/CD and platform operations.
- Owns the GitHub Actions platform, designing reusable workflows and enforcing fully automated end-to-end pipelines.
- Mandates Snyk and SonarQube in all pipelines, enforcing security gates, quality thresholds, and exemption workflows.
- Integrates Terraform IaC execution directly within CI/CD, ensuring infrastructure changes flow through automated controls.
- Owns Backstage lifecycle, including catalog, scaffolder templates, plugin integrations, and adoption governance.
- Builds Software Templates that pre-wire CI/CD, Terraform modules, and security tooling for new services from day one.
- Owns pipeline-to-ServiceNow integration, automating change/release records and gating deployments against approved change windows.
- Leads, mentors, and grows a team of SRE and DevOps engineers, owning technical escalation and platform SLAs/SLOs.
- Drives engineering culture through blameless post-mortems, runbooks, documentation, and operational excellence.
Requirements
-
Bachelor's degree in Computer Science, Engineering, or related field required; Master's degree preferred.
-
7+ years in SRE/DevOps/Platform Engineering, with at least 2+ years in a technical lead or staff-level role.
-
Deep expertise in AWS (EKS, EC2, IAM, Lambda, CloudWatch) and Azure (AKS, Entra ID, Azure Monitor).
-
Expert in Terraform (modules, remote state, pipeline-automated execution, GitOps workflows).
-
Advanced proficiency with GitHub Actions (multi-job workflows, reusable actions, OIDC, secrets management).
-
Production Kubernetes experience (cluster lifecycle, Helm authoring, RBAC, network policies).
-
Hands-on experience with Backstage (catalog config, scaffolder templates, plugin integration, governance).
-
Demonstrated Snyk and SonarQube pipeline integration with enforced security and quality gates.
-
Experience integrating DevOps tooling with ServiceNow change, release, or Digital Release.
-
Proven track record reducing deployment lead time, MTTR, or improving platform reliability.
-
Hospitality, travel, or high-volume consumer tech experience.
-
AWS Solutions Architect Professional; CKA/CKAD certifications a strong plus.
-
Experience with GitOps tooling (ArgoCD, Flux) and progressive delivery (canary, blue/green).
-
Backstage plugin development (TypeScript/React).
-
PCI-DSS, SOC 2, or travel industry compliance background.
-
Source Control: Git / GitHub
-
CI/CD: GitHub Actions
-
IaC: Terraform, Ansible
-
Containers: Kubernetes / Helm (EKS, AKS)
-
Cloud: AWS and Azure
-
Dev Portal: Backstage
-
Security: Snyk, SonarQube
-
Release: ServiceNow Digital Release
-
Effective mentor and collaborator, able to build capability and drive adoption.
-
Strong interpersonal skills to communicate with all levels of management.
-
Ability to work independently and as part of a cross-functional team.