Lead Site Reliability Engineer - (GCP & Kubernetes)

Htc Inc.
Celebration, United States of America
24 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Celebration, United States of America

Tech stack

Amazon Web Services (AWS)
Azure
Bash
Cloud Computing
Cloud Engineering
DevOps
Disaster Recovery
Distributed Systems
Python
Reliability Engineering
Prometheus
Google Cloud Platform
Grafana
Kubernetes Helm Charts
Multi-Cloud
Kubernetes
Cloud Migration
Terraform
Splunk

Job description

We are seeking a Lead Site Reliability Engineer to drive reliability, scalability, and operational excellence across a rapidly growing technology ecosystem. This role serves as a technical leader focused on cloud architecture, Kubernetes platforms, infrastructure automation, and highly available distributed systems. The position plays a key role in defining infrastructure strategy, improving platform resiliency, and mentoring engineering teams., * Design and support highly available cloud infrastructure in GCP

  • Architect and manage Kubernetes environments at scale
  • Build and maintain Infrastructure-as-Code using Terraform
  • Develop and manage Helm charts and Kubernetes deployments
  • Design failover, disaster recovery, and multi-region strategies
  • Improve platform scalability, reliability, and performance
  • Implement monitoring, alerting, and observability best practices
  • Partner with engineering teams on platform architecture and cloud adoption
  • Mentor engineers and provide technical leadership

Requirements

Do you have experience in Terraform?, * 7+ years of experience in Site Reliability Engineering, Platform Engineering, Cloud Engineering, or DevOps

  • Expert-level Kubernetes experience
  • Strong Google Cloud Platform (GCP) experience
  • Expertise with Terraform
  • Experience with Helm
  • Multi-cloud exposure, including AWS and Azure
  • Experience with distributed systems
  • Python or Bash scripting experience
  • Experience with Prometheus, Grafana, Splunk, or OpenTelemetry, * SRE, DevOps, Infrastructure, Platform, or Cloud Operations: 5 years (Required)
  • expert level Kubernetes managing deployments at scale: 3 years (Required)

Benefits & conditions

Pulled from the full job description

  • 401(k)
  • Health insurance
  • 401(k) matching
  • Paid time off
  • Employee discount
  • Vision insurance
  • Health savings account, * 401(k)
  • 401(k) matching
  • Dental insurance
  • Employee assistance program
  • Employee discount
  • Health insurance
  • Health savings account
  • Life insurance
  • Paid time off
  • Relocation assistance
  • Vision insurance

Apply for this position