Site Reliability Engineering Lead

Truist Inc
Atlanta, United States of America
4 days ago

Role details

Contract type
Temporary contract
Employment type
Full-time (> 32 hours)
Working hours
Shift work
Languages
English
Experience level
Senior

Job location

Atlanta, United States of America

Tech stack

Agile Methodologies
Software Applications
Application Release Automation
Audit Trail
Automation of Tests
Unit Testing
Behavior-Driven Development
Cloud Engineering
Configuration Management
Static Program Analysis
Computer Security
Database Schema
DevOps
Distributed Systems
Fault Tolerance
Python
Log Analysis
Powershell
Regression Testing
Reliability Engineering
Ansible
Smoke Testing
Software Configuration Management
Software Engineering
Data Logging
Scripting (Bash/Python/Go/Ruby)
Performance Testing
Test Driven Development
Mttr
Multi-Cloud
HybridCloud
Build Management
Integration Tests
Kubernetes
Deployment Automation
Performance Monitor
Splunk
Software Version Control
Dynatrace
Devsecops
Go

Job description

The Site Reliability Engineering Lead is a senior, hands-on technical leader within the Wholesale Production Support Operations organization. This teammate is accountable for elevating the reliability, resiliency, and operational excellence of critical enterprise platforms across hybrid cloud and onprem environments.

Acting T as both a handson SRE expert and a crossdomain influencer, the SRE Lead drives systemic improvements in observability, automation, AIOps adoption, fault tolerance, and incident management. The role partners closely with Application Development, Infrastructure, Production Support, Platform Delivery, Architecture, Cybersecurity, Risk, and Business technology teams to uplift operational practices and deliver stable, predictable, and scalable services.

This position also plays a pivotal role in building and maturing the SRE Center for Enablement (C4E) by contributing standards, repeatable patterns, runbooks, playbooks, and coaching that amplify reliability practices across the enterprise.

The SRE Lead delivers measurable impact through deep expertise in distributed systems, modern operational tooling, cloud-native reliability patterns, and enterprise-scale incident/problem management., Following is a summary of the essential functions for this job. Other duties may be performed, both major and minor, which are not mentioned below. Specific activities may change from time to time.

1.Guide, educate, and provide thought leadership to our delivery teams as related to their optimum adoption of DevSecOps practices and framework.

  1. Champion the use of DevSecOps as a strategic asset of culture change to enhance the flow of business value to our clients.

  2. Make informed decisions and determine which tool best fits any given situation based on proficiencies with multiple vendor products based on each of the above capabilities.

  3. Develop and recommend DevSecOps best practices.

  4. Use sophisticated, analytical thought to exercise judgment and design innovative solutions for the most complex components of the DevSecOps lifecycle.

  5. Works independently, with guidance in only the most complex situations.

  6. Provide technical and process guidance to junior team members.

  7. Build and maintain the automation and streamlining of software delivery and operations for new or existing software applications through advanced proficiency and subject matter expertise in vendor tools in the DevOps lifecycle including:

a. Infrastructure as Code; Agile and Development Lifecycle Management; Source Code Management; Build Orchestration; Build Management; Artifact Repository Management; Behavior Driven Development; Test Driven Development; Automated Testing including Unit Testing, Integration Testing, Functional Testing, Smoke Testing, Regression Testing, Stress Testing, and Performance Testing; Static Code Analysis; Load and Performance Testing; Artifact Scanning; Database Schema Management, Orchestration and Recovery; Compliance Automation and Audit Trails; Configuration Management; Containers; Application Release Automation; Deployment Strategies and Patterns including Blue/Green Deployment, Canary Releases, and Rolling Releases; Logging and Log Analytics; and Performance Monitoring and Management.

  1. Liaise with DevSecOps Center for Enablement (C4E) to ensure that Enterprise tools or practices are followed, and to share information about any team specific tools or practices that may benefit other teams., Incident & Problem Management Leadership
  • Lead major and high-severity incident response efforts,focusing on diagnosing technical rootcausestherein, and drivingmulti-team technical resolution.

  • Drive problem management to closure, ensuring systemic fixes replace recurring operational risks.

  • Establish andmaintainstandardized incident playbooks, escalation paths, and communication frameworks.

Reliability Engineering & Automation

  • Architect and deliver automation solutions thateliminatetoil, reduce MTTR, and increase service resilience.

  • Implement intelligent alerting, anomaly detection, and event correlationleveragingAI andAIOps tools.

  • Guide and enforce SLO/SLI adoption across product teams, ensuring metrics inform decision-making and prioritization.

Observability & Operational Excellence

  • Enhance telemetry coverage across logs, metrics, traces, and events using platforms such as Dynatrace and Splunk.

  • Define and standardize enterprise observability practices, dashboards, and KPIs.

  • Ensure operational readiness of applications and platforms through resiliency testing, chaos engineering, and failure-mode validation.

Cross-Functional Leadership & Influence

  • Partner with Delivery, Architecture, Security, and Risk teams to embed reliability and resilience into design and execution.

  • Act as a change agent to elevate operational maturity and drive transformative improvements acrossWholesale.

  • Lead workshops, maturity assessments, and enablement sessions through the SRE C4E and Communities of Practice.

Standardization & Documentation

  • Develop,maintain, and enforce runbooks, response playbooks, and automated recovery patterns.

  • Contribute to enterprise SRE frameworks, templates, and maturity models.

  • Promote consistent adoption of best practices across domains and lines of business.

Mentorship & Technical Development

  • Coach and mentor Associate, Professional, and Senior SREs to build technical depth and operational discipline.

  • Provide thought leadership in SRE methodologies, cloud-native operational patterns, and automated reliability engineering.

Requirements

The requirements listed below are representative of the knowledge, skill and/or ability required. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions.

  1. Bachelor degree or equivalent education and related training or experience

  2. Seven+ years of experience in software engineering or IT including at least Four years of experience in a role in which the primary responsibility is DevOps Engineering or the development, maintenance, and support of CI/CD pipelines.

  3. Must demonstrate ability to write code

  4. Foundational cloud architecture knowledge

  5. Must demonstrate ability to construct basic application build pipeline, + 7+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or Infrastructure Operations.

  • Deephandsonexperience with distributed systems, container orchestration (Kubernetes), and cloud-native operational tooling.

  • Proficiencywithautomation and scripting languages (Python, Go, PowerShell, Ansible).

  • Strong understanding of observability platforms (Splunk, Dynatrace) and event-driven monitoring.

  • Proven leadership in major incident management and cross-team technical coordination.

  • Strong grasp of networking, Linux/Unix internals, and modern infrastructure patterns.

  • Excellent communication skills, including executive-level situational awareness during critical incidents.

  • Demonstrated ability to influence technical roadmaps and drive adoption of reliability best practices.

Preferred Qualifications

  • Financial services or regulated industry experience.

  • Experience enabling large-scale SRE transformations or modernization initiatives.

  • Familiarity with chaos engineering, resilience assessments, and service failure modeling.

  • Exposure tohybrid-cloud and multi-cloud operational frameworks.

  • Experience contributing to or leading Center for Enablement functions or Communities of Practice., Able to access and interpret client information received from the computer and able to hear and speak with individuals in person and on the phone.

Manual Dexterity / Keyboarding

Able to work standard office equipment, including PC keyboard and mouse, copy/fax machines, and printers.

Availability

Able to work all hours scheduled, including overtime as directed by manager/supervisor and required by business need.

Travel

Minimal and up to 10%

Apply for this position