Site Reliability Engineer
Role details
Job location
Tech stack
Job description
Platform Reliability & Cloud Engineering
-
Ensure high availability, performance, and security of production systems across Windows, Linux, and Google Cloud Platform environments.
-
Engineer and support containerized workloads using Kubernetes (GKE) and Docker, enabling scalable microservices architectures.
-
Lead infrastructure provisioning and configuration using Terraform, Ansible, and Google Cloud Platform-native tools.
Automation & Observability
-
Develop automation scripts and pipelines to eliminate manual toil and accelerate incident response.
-
Implement observability frameworks using SLIs/SLOs, Prometheus, Grafana, and Google Cloud Platform Operations Suite.
-
Drive proactive monitoring, alerting, and telemetry across hybrid environments.
Requirements
Robust scripting skills in PowerShell, Python, or Shell.
-
Hands-on experience with Google Cloud Platform services, including GKE, IAM, Cloud Functions, and Cloud Monitoring.
-
Proficiency in container technologies: Docker and Kubernetes.
-
Familiarity with Linux system administration and hybrid cloud environments.
-
Experience with infrastructure-as-code tools: Terraform, Ansible.
-
Robust understanding of Active Directory, DNS, DHCP, and Windows security principles.
-
Security certifications (e.g., CISSP, Security+, Google Cloud Platform Professional Cloud Security Engineer).
-
Experience with CI/CD tools (e.g., GitLab CI and Jenkins).
-
Familiarity with ITIL practices and change management.
-
Exposure to ServiceNow, load balancers, certificate management, and endpoint protection tools.
-
Financial Services or highly regulated industry experience.
-
Ability to be on-call over weekends and possibly holidays as needed.