AWS Cloud Reliability Engineer
Role details
Job location
Tech stack
Job description
The AWS Cloud Reliability Engineer is responsible for technical strategies focused on strengthening and modernizing enterprise recovery capabilities within an enterprise AWS ecosystem.
This individual will collaborate with infrastructure, security, compliance, and application teams to improve operational resilience, streamline recovery execution, and support cloud governance initiatives related to performance, availability, and utilization optimization., * Design, implement, and support AWS-based resiliency and recovery solutions across enterprise applications and services.
- Develop, maintain, and continuously improve disaster recovery and cutover runbooks including detailed recovery procedures, system dependencies, stakeholder communications, validation checkpoints, escalation paths, and rollback processes to support recovery and failover events.
- Coordinate and execute disaster recovery exercises, failover testing, and restoration activities while documenting outcomes and driving corrective improvements.
- Build reusable IaC components and operational standards using Terraform, CloudFormation, and related automation technologies.
- Create automated deployment, provisioning, and support workflows through scripting and orchestration tools.
- Enhance operational visibility by implementing monitoring, alerting, and reporting related to backup integrity, replication status, and recovery readiness.
- Partner with governance, security, and risk stakeholders to ensure resiliency solutions align with internal controls and compliance expectations., Vaco by Highspring and its parents, affiliates, and subsidiaries ("we," "our," or "Vaco by Highspring") respects your privacy and are committed to providing transparent notice of our policies.
- California residents may access Vaco by Highspring HR Notice at Collection for California Applicants and Employees here.
- Virginia residents may access our state specific policies here.
- Residents of all other states may access our policies here.
- Canadian residents may access our policies in English here and in French here.
- Residents of countries governed by GDPR may access our policies here.
Requirements
Do you have experience in Windows?, The role requires hands-on AWS engineering coupled with operational readiness, automation, and recovery planning. The ideal candidate brings strong AWS expertise, practical infrastructure-as-code experience, and the ability to build and deploy repeatable recovery processes across complex distributed environments., * Minimum of 5 years experience within cloud engineering, infrastructure operations, DevOps, SRE, or related technical environments including hands-on support of AWS production platforms and resiliency initiatives.
- Strong understanding of disaster recovery operations including resiliency planning, failover testing, recovery validation, operational response, dependency management, and continuous improvement practices.
- Experience supporting cloud infrastructure technologies including networking, storage, compute, backup, replication, monitoring, and identity management services within AWS environments.
- Advanced experience with Infrastructure-as-Code and automation technologies including Terraform, CloudFormation, scripting, and workflow orchestration using Python, PowerShell, Bash, or similar tools.
- Ability to build scalable, repeatable operational processes and translate technical resiliency strategies into measurable business continuity and risk reduction outcomes.
- Experience supporting cloud governance, utilization analysis, tagging strategy, reporting, and cost optimization initiatives aligned with FinOps methodologies and operational efficiency goals.
- Familiarity with monitoring, observability, and logging platforms such as CloudWatch, Splunk, Datadog, or related technologies.
- Working knowledge of Linux and Windows administration, Git-based source control, and CI/CD tooling.
- Experience operating within highly regulated environments where compliance, audit readiness, and operational controls are critical. Familiarity with governance and resiliency frameworks such as NIST, ISO, ITIL, or similar standards is preferred.
- Exposure to containerized and orchestration technologies including Docker, ECS, EKS, and Kubernetes is considered a plus.
- Strong written and verbal communication skills with the ability to collaborate effectively across distributed technical, operational, and business teams, including during high-pressure recovery events or incident response situations.
- Bachelor's degree in Computer Science, Information Technology, Engineering, or equivalent practical experience preferred., * Engagement Status: Applicant must be currently authorized to work in the U.S. without the need for employment-based sponsorship, now or in the future, including: H-1B, L-1, TN, O-1, E-3, H-1B1, F-1, J-1, OPT, CPT, or any other employment-based visa programs.
Benefits & conditions
Pulled from the full job description
- 401(k)
- Health insurance
- Vision insurance
- Dental insurance, Determining compensation for this role (and others) at Vaco by Highspring depends upon a wide array of factors including but not limited to:
- the individual's skill sets, experience and training;
- licensure and certification requirements;
- office location and other geographic considerations;
- other business and organizational needs.
With that said, as required by local law, Vaco by Highspring believes that the following salary range referenced above reasonably estimates the base compensation for an individual hired into this position in geographies that require salary range disclosure. The individual may also be eligible for discretionary bonuses.