Release/Incident Operations Engineer

Everforth Ecs

Fairfax, United States of America

1 month ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Fairfax, United States of America

Tech stack

Comptia Cloud+

Artificial Intelligence

Cloud Computing

CompTIA Security+

Information Systems

Continuous Integration

Data Infrastructure

Elasticsearch

Machine Learning

Prometheus

Grafana

Gitlab

Gitlab-ci

Kubernetes

Cisco networks

VMware

Job description

Everforth ECS is seeking a Release/Incident Operations Engineer to work in the National Capital Region covering the Pentagon, Falls Church, and Fairfax. Please Note: This position is contingent upon contract award.

The War Data Platform (WDP) is a key initiative within the U.S. Department of War's (DoW) AI-First strategy introduced in early 2026. The WDP focuses on operational warfighting data and aims to accelerate the deployment of artificial intelligence (AI) on the battlefield. The WDP extends to Unclassified, Secret, and Top Secret environments, and supports collaboration between Combatant Commands, Joint Staff directorates, Senior Executive Service leaders, and operational analysts.

The Release/Incident Operations Engineer coordinates release operations and incident triage support for AI and machine learning model-serving pipelines across WDP Core Integration's full multi-enclave environment, ensuring deployment consistency and operational continuity in direct support of DoW missions, Joint Staff analysts, Combatant Command elements, and Senior Executive Service leadership. This role is central to sustaining mission-ready AI model-serving performance across all classification levels through disciplined release governance, root-cause analysis, and proactive operational risk management.

Coordinates release operations for artificial intelligence and machine learning model serving across War Data Platform (WDP) Core Integration environments supporting Department of War missions, Joint Staff analysts, Combatant Command elements, and Senior Executive Service leadership.
Directs change-window execution, rollback readiness activities, and deployment governance for model-runtime updates, serving endpoints, and pipeline modifications.
Conducts incident triage support by analyzing telemetry, reviewing service health indicators, and initiating stabilization actions across Kubernetes clusters, VMware environments, GitLab Continuous Integration pipelines, Prometheus metrics, Grafana dashboards, and Elastic Stack observability tooling.
Executes root-cause analysis activities for serving incidents by collecting operational evidence, reconstructing failure sequences, validating remediation steps, and documenting corrective actions aligned with mission assurance requirements.
Maintains operational readiness for model serving by coordinating with Platform One, Cloud One, multi-national engineering teams, and cross-service mission partners to align release activities with enclave-specific constraints, cross-domain deployment architectures, and security requirements.
Produces mission-critical deliverables including release plans, rollback packages, incident triage reports, root-cause analysis documentation, operational risk assessments, and service restoration summaries.
Strengthens program value by advancing deployment consistency, reducing mission risk, and reinforcing operational continuity across all enclaves.
Supports Tier-4 incident response actions to maintain service-level agreements and sustain mission performance for enterprise artificial intelligence model-serving capabilities.
Performs other duties as assigned.

Requirements

Do you have experience in Triage?, * Current Secret security clearance with the ability to obtain and maintain a Top Secret (TS) security clearance with Sensitive Compartmented Information (SCI).

3 or more years of experience in release engineering, incident operations, or platform support roles within a federal government or classified environment, including demonstrated hands-on responsibility for change-window execution, deployment governance, rollback readiness, and incident triage for AI/ML model-serving pipelines or equivalent enterprise cloud-hosted services across multi-enclave or multi-classification environments.
Hands-on experience applying enterprise observability and container orchestration tooling, including Kubernetes, GitLab CI, VMware, Prometheus, Grafana, and Elastic Stack, to diagnose serving failures, analyze pipeline telemetry, execute root-cause analysis, and coordinate stabilization activities across Unclassified, Secret, and Top Secret network environments.
Active DoW 8570/8140-compliant IAT Level II certification, such as CompTIA Security+ CE, CompTIA CySA+, CompTIA Cloud+, Cisco CCNA Security, GIAC GSEC, GIAC GCED, or ISC² SSCP, as required for access to DoW information systems.
Strong problem-solving and decision-making capabilities, with a proven ability to weigh the relative costs and benefits of potential actions and identify the most appropriate solution.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all