Principal Site Reliability Engineer

iCIMS

1 month ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Tech stack

Java

Amazon Web Services (AWS)

User Authentication

Azure

Software as a Service

Cloud Computing

Program Optimization

Databases

DevOps

PostgreSQL

Linux System Administration

Microsoft SQL Server

MongoDB

Performance Tuning

Reliability Engineering

Prometheus

Google Cloud Platform

Okta

Grafana

Cloudformation

Containerization

Kubernetes

Sumo Logic

Terraform

New Relic (SaaS)

Docker

Job description

We are seeking an experienced Sr. Principal Engineer, Site Reliability ( SRE ) to drive technical excellence within our global Site Reliability Engineering organization. This role is essential to maintaining and improving the reliability, scalability, and performance of our multi-cloud SaaS platform serving thousands of customers worldwide. The successful candidate will provide hands-on technical expertise and strategic technical direction in incident response, system optimization, and reliability engineering practices across our complex technology stack. Off hours support as needed, Technical Leadership

Provide strategic technical direction for a team of 5+ SRE engineers across one or more geographic regions (US, Ireland, or India)
Provide technical mentorship and guidance for team members
Drive technical decision-making for complex reliability and performance challenges
Conduct architecture reviews and drive system design decisions for reliability
Lead post-incident reviews and drive implementation of preventive measures

Incident Management & Response

Participate in enterprise-wide incident management, ensuring rapid prevention, detection, response, and resolution
Develop and maintain runbooks and emergency response procedures
Lead root cause analysis and ensure comprehensive documentation
Participate in 24/7 on-call rotation and escalation procedures across global teams
Interface with E ngineering teams and Incident Manager during critical incident resolution

Platform Reliability & Performance

Monitor and optimize multi-cloud infrastructure (AWS primary, Azure, Google Cloud Platform)
Ensure reliability of core services: AWS resources, Auth0/Okta authentication, databases (SQL Server, PostgreSQL, MongoDB), and legacy Java applications
Implement and maintain SLIs, SLOs, and error budgets for assigned services
Drive capacity planning and performance optimization initiatives

Automation & Tooling

Design automation solutions to reduce manual operational overhead
Develop monitoring strategies using New Relic, Grafana, and Sumo Logic

Requirements

Technical Experience

8 + years in SRE, DevOps, or Infrastructure Engineering roles with 4 + years in senior positions
Deep hands-on experience with multi-cloud environments (AWS required, Azure preferred)
Strong Linux system administration and troubleshooting
Experience with containerization (Docker) and orchestration (Kubernetes, ECS)
Proficiency with monitoring tools (New Relic, Grafana, Prometheus)

Leadership & Communication

Proven track record mentoring technical teams and driving technical direction
Experience serving as senior technical leader during critical incidents
Strong communication skills with engineering teams and stakeholders
Cross-functional collaboration in agile environments

SRE & Operations

Demonstrated success implementing SRE principles in large-scale production environments
Experience with ITIL frameworks and tools
Background in establishing and maintaining SLAs for enterprise SaaS products

Preferred

Authentication and identity management systems knowledge
Infrastructure-as-code tools (Terraform, CloudFormation)

About the company

When you join iCIMS, you join the team helping global companies transform business and the world through the power of talent. Our customers do amazing things: design rocket ships, create vaccines, deliver consumer goods globally, overnight, with a smile. As the Talent Cloud company, we empower these organizations to attract, engage, hire, and advance the right talent. We're passionate about helping companies build a diverse, winning workforce and about building our home team. We're dedicated to fostering an inclusive, purpose-driven, and innovative work environment where everyone belongs.