Senior Site Reliability Engineer
Role details
Job location
Tech stack
Job description
The Senior Site Reliability Engineer is a technical leader responsible for architecting the reliability strategy for large-scale, distributed government systems. You will lead the implementation of the SRE framework, driving the adoption of SLO-based management and advanced automation. As a subject matter expert, you will mentor mid-level engineers and interface with government stakeholders to ensure system resilience and performance meet mission requirements., * Reliability Architecture: Define the strategy for Service Level Objectives (SLOs) and Error Budgets. Design complex telemetry pipelines for full-stack observability.
- Strategic Automation: Design and govern the enterprise Infrastructure as Code (IaC) standards. Develop custom tooling to automate complex recovery procedures and system scaling.
- Incident Command: Act as the Incident Commander for major system outages, leading the technical response and directing the Root Cause Analysis (RCA) process.
- Security & Compliance: Lead the integration of security-as-code within DevSecOps pipelines, ensuring full compliance with RMF and NIST 800-53 standards.
- Mentorship: Provide technical guidance and mentorship to Mid-Level SREs and developers, fostering a culture of reliability across the organization.
Requirements
Do you have experience in Risk management?, * 7+ years of experience in SRE or DevOps, with significant experience in distributed systems.
- Expertise in Go, Python, or Java and advanced knowledge of Linux internals.
- Extensive experience managing production Kubernetes environments and complex cloud architectures.
- Proven track record of defining and meeting SLOs for high-availability systems.
- Experience navigating government Risk Management Framework (RMF) processes.
- Education: Bachelor's or Master's degree in Computer Science or Engineering.
- Certifications: CKA (Certified Kubernetes Administrator) and industry observability certification preferred