Site Reliability Engineer (SRE) Cloud Operations
Role details
Job location
Tech stack
Job description
We are seeking a skilled Site Reliability Engineer (SRE) to support and enhance mission-critical Digital Banking platforms running on Azure/AWS, Windows Server, and IIS environments. This role focuses on reliability engineering, cloud operations, production support, observability, and automation across enterprise-scale infrastructure., * Provide operational support for Digital Banking applications across Azure/AWS and Windows/IIS environments
- Monitor and troubleshoot production systems using Dynatrace, Splunk, Windows Event Logs, and PerfMon
- Lead and support P1/P2 incident response, root cause analysis (RCA), and service restoration
- Manage IIS configurations, deployments, patching, SSL/TLS certificates, and production releases
- Support high-availability, disaster recovery (DR), and load-balanced environments
- Automate operational tasks using PowerShell, DSC, and Ansible
- Collaborate with DevOps and Engineering teams to support CI/CD pipelines and improve platform reliability
- Ensure adherence to enterprise security, compliance, and operational standards
Requirements
- Windows Server (2016/2019/2022)
- IIS Administration & Troubleshooting
- Microsoft Azure and/or AWS
- Dynatrace
- Splunk
- PowerShell Automation
Experience troubleshooting using Windows Event Logs and PerfMon
Strong understanding of TCP/IP, HTTP/S, TLS, Load Balancers, and Web Infrastructure
Experience supporting critical production incidents in enterprise environments
Knowledge of Active Directory, GPOs, service accounts, and certificate management
Experience with CI/CD tools such as Azure DevOps, GitHub Actions, or Jenkins
Preferred
- Experience in Banking, Financial Services, or other regulated environments
- Exposure to SRE practices, automation-first operations, and zero-downtime deployments