Site Reliability Engineer (SRE) Cloud Operations
Role details
Job location
Tech stack
Job description
As a Site Reliability Engineer (SRE) Cloud Operations, you will provide operational ownership, reliability engineering, and cloud operations support for a Digital Banking platform running across Windows Server, IIS, and Microsoft Azure. This role focuses on ensuring availability, performance, security, and scalability for customer-facing digital banking workloads. You will be part of a team delivering 24x7 production support across Windows Server (2016/2019/2022) and Azure (Prod/DR) environments, working closely with Application Development, DevOps, Infrastructure, Network, and Security teams to operate at scale using SRE principles., Reliability Engineering & Cloud Operations Provide operational ownership for Digital Banking applications hosted on Windows Server / IIS across Azure and on-prem environments. Apply SRE principles to improve service reliability, availability, and performance. Define and execute operational best practices around stability, resiliency, and controlled change. Support high-availability and disaster-recovery architectures across production and DR environments. Monitoring, Observability & Incident Response (Core Focus) Monitor platform and application health using Dynatrace and Splunk. Perform advanced diagnostics using Windows Event Logs, PerfMon, and Azure Monitor. Lead and participate in P1/P2 incident response, including bridge calls, real-time troubleshooting, and coordination across multiple teams. Drive root cause analysis (RCA) and implement preventive and corrective actions. Track and reduce operational toil through automation and engineering improvements. Application & Platform Operations Support application deployments, hotfixes, and production releases with a strong focus on safety and repeatability. Manage SSL/TLS certificate lifecycle management, including renewals and configuration across IIS and load balancers. Execute and coordinate OS and application patching using WSSCCM and cloud tooling. Support and optimize F5 / ADC load-balanced environments. Security, Compliance & Governance Enforce security and compliance controls including RBAC, least-privilege access, encryption in transit and at rest, Active Directory, GPOs, service accounts, and secrets management. Support audits, risk reviews, and control evidence collection. Automation, CI/CD & Engineering Enablement Build and maintain automation using PowerShell, DSC, and Ansible. Partner with DevOps and AppDev teams to support CI/CD pipelines (Azure DevOps, GitHub Actions, Jenkins) for Windows/IIS workloads. Improve deployment reliability, rollback strategies, and operational guardrails. Contribute to platform designs supporting blue/green, canary, and zero-downtime deployments where applicable.
Requirements
Strong hands-on experience administering Windows Server (2016/2019/2022) in production environments. Strong hands-on experience with IIS, including site configuration, application pools, bindings, performance tuning, and troubleshooting. Hands-on, production experience with Dynatrace for application and infrastructure monitoring. Hands-on, production experience with Splunk for log analysis, queries, dashboards, and troubleshooting. Experience diagnosing system and application issues using Windows Event Logs and PerfMon. Experience supporting high-severity production incidents, including ownership during incident bridges. Working knowledge of TCP/IP, HTTP/S, TLS, and integrations with load balancers, WAFs, and reverse proxies. Experience managing deployments, patching, SSL/TLS certificates, and formal change management processes. Strong PowerShell scripting and automation experience; exposure to DSC and/or Ansible. Experience operating workloads in Azure, including production and DR environments. Working knowledge of Active Directory, GPOs, service accounts, and PKI/certificate management. Bachelor s degree in Computer Science, Information Technology, or equivalent practical experience.