Site Reliability Engineer
Role details
Job location
Tech stack
Job description
The Site Reliability Engineer will focus on the execution and maintenance of reliability engineering practices for mission-critical government systems. Following the SRE Implementation Plan, you will bridge the gap between development and operations by applying a software engineering mindset to system administration. You will be responsible for building automation, maintaining CI/CD pipelines, and ensuring system health through robust monitoring., * Monitoring & Observability: Implement and maintain dashboards and alerting rules using Prometheus, Grafana, or ELK Stack. Support the identification of Service Level Indicators (SLIs).
- Automation: Develop and maintain Infrastructure as Code (IaC) scripts using Terraform and Ansible to ensure repeatable, error-free deployments.
- CI/CD Management: Maintain automated deployment pipelines, ensuring security scans and automated tests are integrated into the workflow.
- Incident Response: Participate in on-call rotations and assist in troubleshooting system outages. Contribute to blameless post-mortem reports to drive continuous improvement.
- Toil Reduction: Identify repetitive manual tasks and develop automation to reduce "toil," allowing the team to focus on high-value engineering.
Requirements
Do you have experience in NIST standards?, Do you have a Bachelor's degree?, * 3-5 years of experience in SRE, DevOps, or Systems Engineering roles.
- Proficiency in scripting languages (Python, Go, or Bash).
- Hands-on experience with containerization (Docker, Kubernetes) and cloud platforms (AWS, Azure, or GCP).
- Familiarity with NIST SP 800-53 security controls.
- Education: Bachelor's degree in Computer Science or a related technical field.