SENIOR CLOUD RELIABILITY & TELEMETRY ENGINEER

Covalent Solutions, LLC
Washington, United States of America
4 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 165K

Job location

Washington, United States of America

Tech stack

Adobe Analytics
Web Accessibility
Agile Methodologies
Systems Engineering
Cloud Computing
Cloud Engineering
Computer Engineering
Monitoring of Systems
Reliability Engineering
Web Content Accessibility Guidelines
Datadog
Data Logging
Enterprise Software Applications
Cloud Platform System
System Availability
Mttr
Backend
Information Technology
Legacy Systems

Job description

Covalent Solutions, LLC (Covalent) is seeking a highly vigilant and proactive Senior Cloud Reliability & Telemetry Engineer to serve as the "Reliability Orchestrator" for our Federal Government client's cloud environment. Moving beyond traditional operations engineering, you will be the ultimate guardian of day-to-day platform availability, cloud cost-efficiency, and system integrity. In this role, you will actively monitor the environment to detect configuration drift, build telemetry loops that bridge modern cloud setups with legacy systems, and automate standard operational workflows. As a core technical anchor, you will provide critical, hands-on engineering support to both external infrastructure vendors and internal development teams whenever issues surface, ensuring total operational visibility and automated incident response., Location: Remote eligible; however, candidates local to the Washington, D.C., Maryland, or Virginia (DMV) area are strongly preferred. Personnel must be available to travel to the client s Washington, D.C. offices as needed and as directed by leadership to attend critical meetings, technical exchanges, or collaborative sessions., The Senior Cloud Reliability & Telemetry Engineer will provide the reliability engineering, automated remediation, and visibility services necessary to safeguard the federal enterprise cloud platform. Key responsibilities include, but are not limited to:

  • Platform Operations & High Availability: Maintain the high availability, scalability, resilience, and target uptime baseline (99.5%+) of the pilot cloud platform, internal automation tooling, and foundational environment frameworks.
  • Telemetry & System Auditing: Build, optimize, and maintain standardized logging, monitoring, and auditing dashboards across all organizational cloud accounts to provide total environment visibility and satisfy strict federal compliance frameworks (NIST, FedRAMP, FISMA).
  • Drift Detection & Proactive Remediation: Continuously monitor cloud environments to identify infrastructure configuration drift. Document variances and engineer automated remediation workflows or technical playbooks to guide external infrastructure vendors in restoring baselines.
  • Rapid Security Vulnerability Fixes: Act as the rapid-response operational anchor to provide hands-on technical support, updated deployment reference patterns, and source-code fixes within 72 hours of validating critical security gaps or insecure implementation patterns.
  • Legacy System Modernization Support: Apply customized monitoring, logging, alerting, and metrics-gathering utilities to integrate legacy enterprise software safely into modern, unified cloud visibility views.
  • Financial Operations: Perform routine resource utilization reviews to identify, track, and share quarterly cloud cost-optimization and right-sizing recommendations with vendor teams and senior stakeholders.
  • Cross-Vendor Collaborative Engineering: Actively participate in cross-vendor governance meetings and integrated planning cadences to review standards adoption, mitigate operational risks, handle on-call production needs, and prevent project velocity slowdowns.
  • Accessibility Compliance: Ensure all operational dashboards, reporting templates, and internal-facing interfaces developed by the team fully comply with applicable Section 508 and WCAG 2.1 AA accessibility standards, executing remediation for any identified issues within 30 days.

Requirements

Do you have experience in Tooling?, Do you have a Bachelor's degree?, * Education: Bachelor s degree in Computer Science, Cloud Computing, Computer Engineering, or a related technical field (equivalent practical systems engineering or military IT experience will be considered). Master s degree preferred.

  • Years of Experience: Minimum of 6+ years of progressive professional experience in site reliability engineering, cloud infrastructure operations, or systems administration.
  • Automation Production Background: At least 2+ years of direct experience supporting, scaling, and configuring automation, deployment, and monitoring tooling within a production federal or highly regulated cloud platform environment.
  • Observability Tooling Expertise: Advanced, hands-on capability with modern cloud monitoring frameworks, centralized logging stacks, alerting engines, and usage telemetry configuration to track error budgets and incident MTTR.
  • Incident & Drift Automation: Proven experience using code to automate standard system monitoring tasks, configuration drift identification, and operational self-healing recovery paths.
  • Blended Systems Engineering: Strong technical problem-solving and troubleshooting skills across complex, hybrid federal environments containing both modern cloud infrastructure and multi-vendor legacy backend sources.
  • Agile Environment: Proven experience working within an outcome-driven, fast-paced Agile software development environment using capacity-based team models.
  • Product-Centric Delivery Alignment: Demonstrated success collaborating directly with product managers and cross-functional teams to continuously align technical infrastructure delivery with user needs, strategic product backlogs, and operational priorities., * Accountability: Takes full ownership of all platform operations and performance commitments, consistently delivering high-quality outputs within prescribed standards while ensuring personal responsibility for environment compliance.
  • Multi-Project Dependency Coordination: Exceptional organizational and coordination skills with the technical capability to balance multiple moving pieces, operational tickets, and system constraints simultaneously without losing sight of architectural details.
  • Communication & Technical Coaching: Strong verbal and written communication skills, with a natural ability to translate complex system logs into actionable executive dashboards for leadership or external vendors, while actively mentoring peers.
  • Judgment & Problem Solving: Ability to work independently with minimal oversight, exercise sound technical judgment during high-severity incidents, and resolve complex process-based or tooling roadblocks proactively.
  • Mission-First & User-Centered Mindset: A service-oriented, highly collaborative approach to supporting federal partners and external stakeholders, focusing on the overall benefit of the digital product to maximize delivery velocity., * Are you a US Citizen or US permanent resident?
  • Are you willing to undergo the process of obtaining a US Public Trust Clearance?
  • Do you have 4+ years of cloud automation production professional experience?
  • Do you have 4+ years of professional hands on experience of implementing capabilities with modern cloud monitoring frameworks, centralized logging stacks, alerting engines, and usage telemetry configuration to track error budgets and incident MTTR.
  • Do you have 4+ years of professional hands on experience of using code to automate standard system monitoring tasks, configuration drift identification, and operational self-healing recovery paths?
  • Do you have 4+ years of professional hands on experience of Strong technical problem-solving and troubleshooting skills across complex, hybrid federal environments containing both modern cloud infrastructure and multi-vendor legacy backend sources?
  • Do you have 4+ years of professional hands on experience of working within an outcome-driven, fast-paced Agile software development environment using capacity-based team models?
  • Do you have 2+ years of professional experience of demonstrating success collaborating directly with product managers and cross-functional teams to continuously align technical infrastructure delivery with user needs, strategic product backlogs, and operational priorities?

Education:

  • Bachelor's (Required)

Experience:

  • Cloud Engineering: 6 years (Required)

Language:

  • English (Required)

Benefits & conditions

Pulled from the full job description

  • 401(k)
  • Health insurance
  • Paid time off
  • Vision insurance
  • Dental insurance
  • Life insurance, * 401(k)
  • Dental insurance
  • Health insurance
  • Life insurance
  • Paid time off
  • Vision insurance

Apply for this position