SRE - Site Reliability Engineer
Role details
Job location
Tech stack
Job description
Operate and enhance our Kubernetes platform across AWS, Azure, and on prem. Lead incident response, problem management, and root cause analysis. Deliver cluster life cycle work: upgrades, patching, node pools, CNI/CSI, ingress, and Rancher operations. Own observability, dashboards, alerting, and SLOs/SLIs. Implement GitOps (Fleet) and reduce toil through automation and strong governance. Apply secure API gateway and WAF patterns. Work with distributed system patterns, including event brokers and asynchronous messaging. Maintain security posture: CVE remediation, GRC controls, scanning pipelines.
Requirements
Deep knowledge of Kubernetes, Rancher, GitOps, Linux, and cloud networking. Understanding of API gateway and WAF patterns. Experience with distributed systems and event driven architectures. Strong automation/Scripting (Python, Go, Bash, PowerShell, .NET).
IaC: o Terraform for foundational/bootstrap cluster provisioning. o Crossplane as an orchestration layer (leveraging Terraform providers). Ability to work securely within PCI DSS/GDPR patterns. CI/CD: Concourse, GitHub Actions, Azure DevOps. Observability: Grafana, Prometheus, Jaeger/Tempo, CloudWatch, Loki, OpenTelemetry.
Nice to Have AWS operational experience. Service mesh (Istio/Kuma). Hybrid cloud experience (AWS + Azure + on prem). Payments or regulated industry background.