Senior Software Engineer: Site Reliability Engineering
Role details
Job location
Tech stack
Job description
Enterprise Information Technology (EIT) organization is expanding, and we are seeking a Senior Site Reliability Engineer to help drive a major architectural modernization. In this role, you will move beyond traditional infrastructure maintenance to build a proactive, engineering-led ecosystem across our large-scale multi-cloud and co-location footprint.
Working closely with cross-functional IT teams and business units, you will help design and implement standards for our hybrid cloud datacenter model. A primary focus of this position will be the architectural redesign, optimization, and migration of legacy on-premises workloads into Google Cloud Platform (GCP), ensuring everything we build adheres to rigorous SRE principles.
Mission & Impact:
-
Everything as Code: Drive repository-led management across our public and private cloud environments to establish consistency and eliminate manual configuration drift.
-
Engineering over Toil: Heavily leverage Infrastructure as Code (IaC) to automate repetitive tasks, build smooth "paved roads" for product teams, and develop self-healing systems.
-
SLO-Driven Architecture: Help shift our operational focus from traditional component monitoring to user-facing symptoms, defining meaningful SLOs and error budgets.
-
Modernization & Migration: Lead the technical execution of re-architecting and redeploying on-premises services into GCP, ensuring scalability, performance, and long-term reliability.
This position may be worked remotely within the United States, with the exception of California.
This position is ineligible for immigration sponsorship and support. Please do not apply if at any time you will need immigration support now or in the future (i.e., H-1B, PERM). All positions, regardless of location, may require an onsite interview or in-person onboarding requirement to verify your identity.
What you'll be responsible for:
-
Service Reliability and Performance:Drive the reliability and performance of both public cloud (production, testing, and development) and internal server infrastructure environments.
-
SRE Practice Implementation:Design and implement robust Site Reliability Engineering practices, including defining and monitoring Service Level Objectives (SLOs) and Service Level Indicators (SLIs), focusing on proactive system health and error budgets.
-
Automation and Toil Reduction:Ruthlessly eliminate manual, repetitive work (toil) through automation. Develop and maintain automation scripts and tooling to streamline operations across the hybrid datacenter model (on-premises and public cloud).
-
Implement "Everything as Code":Treat the cloud and on-prem operational environment as a software project by using Infrastructure as Code (IaC) with tools like Terraform, Ansible, GitHub for provisioning and configuration.
-
Configuration and State Management:Design and maintain rigorous configuration management processes to guarantee the consistency and desired state of the hybrid datacenter infrastructure, leveraging tools like Ansible.
-
Observability and Alerting:Establish and manage comprehensive monitoring and alerting systems to provide deep visibility into the health and performance of services.Build systems that are self-healing and advocate for themselves.
-
Post-Mortems and Root Cause Analysis (RCA):Lead blameless post-mortems and RCAs for critical incidents, focusing on system-level improvements to prevent recurrence and enhance overall reliability.
-
Security and Compliance Automation:Develop and implement strategies for efficient patch and vulnerability management across all environments. Automate security remediation efforts to ensure timely vulnerability mitigation and compliance (e.g., CIS, NIST, PCI).
-
Cloud Evolution and Migration:Support the company's strategic growth into public cloud services (GCP, Azure) and play a key role in the migration and redesign of services from on-premises data centers to GCP, ensuring adherence to SRE principles throughout the transition.
-
Cross-Functional Collaboration:Partner closely with DevOps and development teams to embed reliability best practices throughout the software development lifecycle, ensuring seamless integration and operation of hybrid datacenter services.
-
Documentation:Maintain comprehensive and actionable documentation for SRE processes, operational runbooks, and configurations.
-
May perform other duties as assigned., No one will be subject to, and Jack Henry prohibits, any form of discipline, reprisal, intimidation, or retaliation for good faith reports or complaints of discrimination of any kind, pursuing any discrimination claim, or cooperating in related investigations.
Requests for full corporate job descriptions may be requested through the interview process at any time. Equal Employment Opportunity
Applicants for U.S. based positions with Jack Henry & Associates must be legally authorized to work in the United States. Verification of employment eligibility will be required at the time of hire. Visa sponsorship is not available for this position.
Jack Henry & Associates, Inc. is an Equal Employment Opportunity/Affirmative Action Employer and maintains a Drug-Free Workplace.
Requirements
-
Minimum 6 years of experience in cloud and hybrid datacenter operations with a focus on Infrastructure as Code (IaC) and Site Reliability Engineering.
-
Proficiency with GCP (preferred), AWS, and/or Azure.
-
Proficient in using GitOps, Terraform and Ansible in a CI/CD (continuous integration and continuous delivery) pipeline.
-
Experience using PowerShell, Python, or GoLang.
-
Solid understanding of Linux (POSIX) and Windows System administration as well as networking and firewalls.
-
Understanding of security best practices and compliance standards such as CIS, NIST and PCI.
-
Ability to participate in an on-call rotation every 7-8 weeks.
What would be nice for you to have:
-
Bachelor's degree in Computer Science Information Technology, Engineering.
-
Relevant industry certifications. Google Associate Cloud Engineer or Google Cloud Architect preferred.
-
Proficient in ArgoCD and GitOps.
-
Familiarity with SQL and NoSQL databases.
-
Experience with Open Telemetry tooling and alerting such as Prometheus, Grafana, ELK Stack, et al.
-
Experience with Site Reliability Engineering (SRE) principles, including but not limited to Service Level Objectives (SLO) and Service Level Indicators (SLI), TOIL Reduction, Automation, and Root Cause Analysis.
If you got this far, we hope you're feeling excited about this opportunity. Even if you don't feel you meet every single requirement on this posting, we still encourage you to apply. We're looking for passionate, driven individuals who align with our mission and can bring unique perspectives to our team.
Why Jack Henry?