Principal Site Reliability Engineer (SRE)
Role details
Job location
Tech stack
Job description
-
Design and implement highly available, scalable infrastructure systems that support mission-critical production services, including automated deployment pipelines, observability platforms, and disaster recovery
-
Lead incident response and postmortem processes, diving deep into complex distributed systems failures to identify root causes and drive systemic reliability improvements across engineering teams
-
Develop and maintain service level objectives (SLOs) and error budgets, using data-driven approaches to balance feature velocity with system reliability and guide organizational decision-making
-
Build tooling and automation to eliminate toil, improve operational efficiency, and enable engineering teams to safely deploy and operate services with minimal manual intervention
Requirements
-
7+ years of relevant experience
-
Bachelor's degree in relevant field(s) of study or equivalent
Preferred Qualifications
-
5+ years of experience in site reliability engineering, systems engineering, or DevOps roles with a proven track record of maintaining large-scale production systems
-
Deep expertise in cloud AWS including infrastructure as code tools like Terraform, CloudFormation, or Pulumi
-
Experience defining and measuring SLIs, SLOs, and error budgets, and using them to drive reliability improvements and inform product decisions
-
Proficiency in AI development
-
Strong programming skills in languages such as Python, Go, or Node with the ability to write production-quality code for automation, tooling, and system integration
-
Extensive experience with container orchestration (ECS or similar) and microservices architectures in production environments
-
Proficiency with observability and monitoring tools such as: Dynatrace, Prometheus, Grafana, Datadog, New Relic, or similar and experience building comprehensive monitoring and alerting systems
-
Solid understanding of networking concepts, load balancing, CDNs, DNS, and distributed systems principles including consensus algorithms and failure modes
-
Hands-on experience with CI/CD pipelines and GitOps workflows using tools like Jenkins, GitHub Actions, ArgoCD, or CircleCI
-
Strong incident management and troubleshooting skills with the ability to quickly diagnose and resolve complex production issues under pressure
-
Excellent communication and collaboration skills with the ability to influence technical direction across multiple teams and mentor engineers at various levels
Benefits & conditions
Ally's compensation program offers market-competitive base pay and pay-for-performance incentives (bonuses) based on achieving personal and company goals. But Ally's total compensation - or total rewards - extends beyond your paycheck and is designed to support and enrich your personal and professional life, including:
-
Time Away: competitive holiday and flexible paid-time-off, including time off for volunteering and voting.
-
Planning for the Future: plan for the near and long term with an industry-leading 401K retirement savings plan with matching and company contributions, student loan and 529 educational assistance programs, tuition reimbursement, and other financial well-being programs.
-
Supporting your Health & Well-being: flexible health and insurance options including dental and vision, pre-tax Health Savings Account with employer contributions and a total well-being program that helps you and your family stay on track physically, socially, emotionally, and financially.
-
Building a Family: adoption, surrogacy, and fertility support as well as parental and caregiver leave, back-up child and adult/elder day care program and childcare discounts.
-
Work-Life Integration: other benefits including LifeMatters® Employee Assistance Program, subsidized and discounted Weight Watchers® program and other employee discount programs.