Site Reliability Engineer (SRE) Cloud Operations

Infinite Computer Solutions (ICS)
Alpharetta, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Alpharetta, United States of America

Tech stack

Microsoft Windows
Microsoft Active Directory
Amazon Web Services (AWS)
Azure
Banking Software
Cloud Computing
Continuous Integration
DevOps
Disaster Recovery
Event Logging
Github
IIS
Windows Server
Powershell
Reliability Engineering
Site Reliability Engineering Practices
Ansible
TCP/IP
SSL Certificate Management
Transport Layer Security
Load Balancing
Perfmon
Splunk
Dynatrace
Jenkins

Job description

We are seeking a skilled Site Reliability Engineer (SRE) to support and enhance mission-critical Digital Banking platforms running on Azure/AWS, Windows Server, and IIS environments. This role focuses on reliability engineering, cloud operations, production support, observability, and automation across enterprise-scale infrastructure., * Provide operational support for Digital Banking applications across Azure/AWS and Windows/IIS environments

  • Monitor and troubleshoot production systems using Dynatrace, Splunk, Windows Event Logs, and PerfMon
  • Lead and support P1/P2 incident response, root cause analysis (RCA), and service restoration
  • Manage IIS configurations, deployments, patching, SSL/TLS certificates, and production releases
  • Support high-availability, disaster recovery (DR), and load-balanced environments
  • Automate operational tasks using PowerShell, DSC, and Ansible
  • Collaborate with DevOps and Engineering teams to support CI/CD pipelines and improve platform reliability
  • Ensure adherence to enterprise security, compliance, and operational standards

Requirements

  • Windows Server (2016/2019/2022)
  • IIS Administration & Troubleshooting
  • Microsoft Azure and/or AWS
  • Dynatrace
  • Splunk
  • PowerShell Automation

Experience troubleshooting using Windows Event Logs and PerfMon

Strong understanding of TCP/IP, HTTP/S, TLS, Load Balancers, and Web Infrastructure

Experience supporting critical production incidents in enterprise environments

Knowledge of Active Directory, GPOs, service accounts, and certificate management

Experience with CI/CD tools such as Azure DevOps, GitHub Actions, or Jenkins

Preferred

  • Experience in Banking, Financial Services, or other regulated environments
  • Exposure to SRE practices, automation-first operations, and zero-downtime deployments

Apply for this position