Principal Site Reliability Engineer
Role details
Job location
Tech stack
Job description
We are seeking an experienced Sr. Principal Engineer, Site Reliability ( SRE ) to drive technical excellence within our global Site Reliability Engineering organization. This role is essential to maintaining and improving the reliability, scalability, and performance of our multi-cloud SaaS platform serving thousands of customers worldwide. The successful candidate will provide hands-on technical expertise and strategic technical direction in incident response, system optimization, and reliability engineering practices across our complex technology stack. Off hours support as needed, Technical Leadership
-
Provide strategic technical direction for a team of 5+ SRE engineers across one or more geographic regions (US, Ireland, or India)
-
Provide technical mentorship and guidance for team members
-
Drive technical decision-making for complex reliability and performance challenges
-
Conduct architecture reviews and drive system design decisions for reliability
-
Lead post-incident reviews and drive implementation of preventive measures
Incident Management & Response
-
Participate in enterprise-wide incident management, ensuring rapid prevention, detection, response, and resolution
-
Develop and maintain runbooks and emergency response procedures
-
Lead root cause analysis and ensure comprehensive documentation
-
Participate in 24/7 on-call rotation and escalation procedures across global teams
-
Interface with E ngineering teams and Incident Manager during critical incident resolution
Platform Reliability & Performance
-
Monitor and optimize multi-cloud infrastructure (AWS primary, Azure, Google Cloud Platform)
-
Ensure reliability of core services: AWS resources, Auth0/Okta authentication, databases (SQL Server, PostgreSQL, MongoDB), and legacy Java applications
-
Implement and maintain SLIs, SLOs, and error budgets for assigned services
-
Drive capacity planning and performance optimization initiatives
Automation & Tooling
-
Design automation solutions to reduce manual operational overhead
-
Develop monitoring strategies using New Relic, Grafana, and Sumo Logic
Requirements
Technical Experience
-
8 + years in SRE, DevOps, or Infrastructure Engineering roles with 4 + years in senior positions
-
Deep hands-on experience with multi-cloud environments (AWS required, Azure preferred)
-
Strong Linux system administration and troubleshooting
-
Experience with containerization (Docker) and orchestration (Kubernetes, ECS)
-
Proficiency with monitoring tools (New Relic, Grafana, Prometheus)
Leadership & Communication
-
Proven track record mentoring technical teams and driving technical direction
-
Experience serving as senior technical leader during critical incidents
-
Strong communication skills with engineering teams and stakeholders
-
Cross-functional collaboration in agile environments
SRE & Operations
-
Demonstrated success implementing SRE principles in large-scale production environments
-
Experience with ITIL frameworks and tools
-
Background in establishing and maintaining SLAs for enterprise SaaS products
Preferred
-
Authentication and identity management systems knowledge
-
Infrastructure-as-code tools (Terraform, CloudFormation)