Site Reliability Engineer - Ctj - Poly

Microsoft
Redmond, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 235K

Job location

Reston, United States of America

Tech stack

Audit Trail
Azure
Microsoft Online Services
Linux
Distributed Systems
Github
Red Hat Enterprise Linux - RHEL
Reliability Engineering
Site Reliability Engineering Practices
Ansible
Software Engineering
Data Processing
Information Technology

Job description

  • Owns reliability architecture and end-to-end service understanding (dependencies, failure modes, and customer journeys) for distributed systems at scale. Defines and improves service health via SLIs/SLOs, error budgets, and well-defined operational readiness criteria. Drives cross-team reliability reviews and recommends design changes, runbooks, and safe rollout/rollback strategies that improve availability, latency, performance, and efficiency while managing cost.

  • Maintains deep, current expertise in cloud reliability practices and the evolving technology landscape. Drives adoption of new platform capabilities and operational patterns (e.g., progressive delivery, resilience testing, chaos engineering where appropriate). Mentors engineers through design reviews, incident walkthroughs, and knowledge sharing to raise the reliability bar across related services.

  • Implements reliable, scalable, and high-performance changes using SRE practices (progressive delivery, feature flags where applicable, safe rollouts/rollbacks). Owns implementation and rollback plans, validates operational readiness, and reduces toil through automation, self-healing, and standardized playbooks.

  • Leverages telemetry and production signals to identify reliability risks and recurring failure patterns, then ships configuration changes, code fixes, or automation to address root causes. Expands infrastructure-as-code and operational tooling so teams can manage platforms and services safely and repeatably through code and policy.

  • Builds and improves observability (metrics, logs, traces, dashboards, alerts) and uses it to detect, diagnose, and prevent incidents. Defines actionable alerting, reduces noise, and ensures instrumentation supports SLO reporting and rapid troubleshooting. Develops automation to validate telemetry pipelines and to enable automated mitigation and safer incident response.

  • Participates in on-call rotations and leads response for complex, high-impact incidents by establishing incident command, assessing impact, coordinating responders, and driving mitigations to restore service within SLOs. Produces and contributes to blameless postmortems with corrective and preventative actions (CPAs), tracks them to completion, and implements automation and guardrails to prevent recurrence.

  • Applies secure-by-design and compliance requirements to operations, monitoring, and automation (least privilege, auditability, change control, and data handling). Partners with security, privacy, and compliance teams to identify gaps, prioritize fixes, and implement automated controls and detection to prevent repeated violations

  • Embody our culture (https://careers.microsoft.com/v2/global/en/culture) and values (https://www.microsoft.com/en-us/about/corporate-values)

Requirements

  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.

Other Requirements:

Security Clearance Requirements: Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:

  • The successful candidate must have an active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI) with Polygraph. Ability to meet Microsoft, customer and/or government security screening requirements are required pre-offer and post-hire for this role. Failure to maintain or obtain the appropriate U.S. Government clearance and/or customer screening requirements may result in employment action up to and including termination.

  • Clearance Verification : This position requires successful verification of the stated security clearance to meet federal government customer requirements. You will be asked to provide clearance verification information prior to an offer of employment.

  • Microsoft Cloud Background Check : This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

  • Citizenship & Citizenship Verification: This position requires verification of U.S. citizenship due to citizenship-based legal restrictions. Specifically, this position supports United States federal, state, and/or local United States government agency customer and is subject to certain citizenship-based restrictions where required or permitted by applicable law. To meet this legal requirement, citizenship will be verified via a valid passport, or other approved documents, or verified US government Clearance

Preferred Qualifications:

  • Bachelor's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, service engineering, or systems engineeringOR equivalent experience.

  • 3+ years technical experience working with large-scale cloud or distributed systems

  • Experience building automation with Ansible and developing/operating CI/CD pipelines (e.g., Azure DevOps, GitHub Actions) to deliver reliable, repeatable deployments.

  • Expertise in problem solving and analyzing distributed systems and critical production service environments

  • Expertise in Linux, specifically Rocky 9, Redhat, Mariner or similar in throughput management, troubleshooting and security hardening

Site Reliability Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year.

About the company

Microsoft is a global technology company headquartered in Redmond, Washington. Our mission is to empower every person and every organization on the planet to achieve more. We develop, license, and support a wide range of software products, services, and devices that help individuals and businesses realize their full potential.

Our flagship products include the Microsoft 365 productivity cloud, Windows operating system, Azure cloud platform, and Dynamics 365 business applications. We are also a leader in areas such as artificial intelligence, cybersecurity, developer tools, and gaming through Xbox and Game Pass.

With operations in more than 190 countries and over 220,000 employees worldwide, Microsoft is committed to responsible innovation, inclusive economic growth, and sustainability. We work closely with governments, industries, and communities to ensure that technology serves the public good and helps address some of the world’s most pressing challenges.

As we celebrate our 50th anniversary in 2025, we continue to look forward—investing in AI, cloud, and quantum computing to shape the future of work, education, and society at large scale.

Apply for this position