Lead Site Reliability Engineer

McGraw-Hill
1 month ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 155K

Job location

Remote

Tech stack

Agile Methodologies
Amazon Web Services (AWS)
Application Performance Management
Systems Engineering
Cloud Computing
Code Review
Computer Security
DevOps
Github
Identity and Access Management
Uptime
Scrum
Software Engineering
Datadog
Enterprise Software Applications
Performance Monitor
Cloudwatch
Terraform
New Relic (SaaS)

Job description

  • Lead a 6 member SRE team supporting production infrastructure and services
  • Manage backlog, sprint planning, and team velocity
  • Own reliability, uptime, security, cost, and performance of services
  • Define and monitor SLOs for application workloads
  • Plan on-call rotations and work to reduce alert fatigue
  • Forecast seasonal growth and capacity planning
  • Mentor engineers and foster professional growth
  • Report status and issues to leadership monthly
  • Partner with development teams
  • Collaborate with CyberSecurity on risk mitigation
  • Collaborate with FinOps on cost reduction
  • Design and troubleshoot highly-distributed, cloud-based production systems
  • Maintain infrastructure-as-code and monitoring-as-code practices
  • Improve system resiliency through failure injection and chaos testing
  • Participate in on-call rotation and resolve operational issues
  • Optimize existing systems for performance and cost
  • Ensure telemetry provides visibility to application performance
  • Support agile development practices and code reviews

Requirements

  • 5+ years of experience in SRE, DevOps, or Software Engineering roles supporting enterprise applications.
  • Strong problem-solving, triage, and root cause analysis skills with a systems engineering mindset
  • Deep expertise in the AWS ecosystem, with hands-on experience across core services including primarily ECS, RDS, EKS, IAM, CloudWatch, and networking configurations.
  • Expertise with Terraform for managing and automating scalable cloud infrastructure
  • Skilled in CI/CD pipelines (e.g., GitHub Actions) and managing end-to-end software delivery lifecycles.
  • Strong familiarity with telemetry and observability tools (e.g., New Relic, Datadog), including querying logs and metrics for performance monitoring.

About the company

McGraw Hill, a leading provider of digital educational resources and content, is seeking a Lead Site Reliability Engineer to lead a team of 6 Engineers for our Digital Platform Group in supporting our K-12 learning platforms. These platforms serve millions of students and educators nationwide, and you'll play a key role in ensuring their reliability, scalability, and performance. Working closely with engineering and product teams, you'll leverage your expertise in AWS, Terraform, and observability tools to drive automation, enhance resiliency, and maintain the health of our cloud-based infrastructure.

Apply for this position