Executive Director, AI Infrastructure & Platform Engineering
Role details
Job location
Tech stack
Job description
The Executive Director, AI Infrastructure & Platform Engineering is a senior engineering leadership role responsible for standing up, operating, and continuously improving CVS Health's on-premises AI compute platform. This position owns the physical and platform layers of CVS's Enterprise AI Factory - a frontier-class GPU compute environment running NVIDIA Blackwell systems across a high-throughput RoCE v2 fabric, hosted in co-located data center facilities, with multi-site expansion underway.
Reporting to the Global Head of Infrastructure/AI Operations and Service Delivery, this leader will establish operational baselines across the full infrastructure stack - hardware, network fabric, GPU clusters, storage, and the operating systems and orchestration layers above - and build the Site Reliability Engineering practice that delivers the availability, reliability, and performance that frontier AI workloads demand.
This is a greenfield organizational build. The Executive Director will define the operating model, set the engineering standards, hire and develop the team, and establish the long-term operations capability that will govern CVS's AI infrastructure for years ahead., Strategy and Leadership:
- Define and execute the long-range vision and strategy for AI infrastructure and platform engineering, with availability (>99.99%), reliability, and platform performance as the primary measures of success.
- Recruit, hire, develop, and retain a high-performing engineering organization spanning infrastructure, network, platform reliability, observability, security, 24/7 operations, change and release management, and FinOps.
- Establish clear ownership, accountability, and performance expectations across all functional teams; foster a culture of operational excellence, engineering rigor, and continuous improvement.
- Provide executive-level communication to senior leadership on platform status, milestones, risk posture, and strategic initiatives.
Infrastructure and Platform Engineering:
- Own the physical layer of the AI compute environment - GPU compute, storage, network fabric, capacity planning, and hardware lifecycle accountability.
- Direct bare-metal Kubernetes and OpenShift operations, including cluster administration, GPU quota governance, infrastructure-as-code adoption, and availability baseline enforcement.
- Govern high-performance network fabric operations - RoCE v2, spine-leaf topology, lossless Ethernet tuning, congestion management, and segmentation.
- Establish and enforce operational baselines across every layer of the stack - hardware, fabric, platform, and workload - with deviations detected, escalated, and resolved within defined SLAs.
- Direct Innovation POD strategy to develop self-healing and autonomous capabilities that proactively prevent service degradation before it impacts availability.
Operations and Reliability:
- Build and sustain a high-performing 24/7 operations model - designed for sustainable, predictable coverage with no mandatory overtime and measurable team health and retention.
- Drive end-to-end observability across the physical and platform layers, with continuous feedback loops connecting monitoring data to incident response, change decisions, and improvement cycles.
- Oversee change management so every modification is risk-assessed, monitored during rollout, and baseline-validated post-deployment.
- Ensure configuration consistency and drift detection across all platform components to prevent baseline degradation over time.
- Lead GPU FinOps governance - utilization optimization, tenant quota enforcement, and cost reduction - in partnership with the Finance organization.
Security and Compliance:
- Empower the Security SRE Lead to maintain a world-class security posture across the infrastructure and platform layers, with robust compliance to frameworks including HIPAA and NIST AI RMF.
- Govern access controls, audit logging, vulnerability management, and network segmentation across the AI compute environment.
Program Transition and Operating Model:
- Lead the operational transition from program-launch staffing to permanent CVS-owned operations - governing phased handoffs, competency validation, and milestone sign-offs to ensure minimal disruption to platform availability and business operations.
- Establish and lead the long-term operating model by institutionalizing key technical, architectural, and delivery leadership capabilities into permanent CVS roles, ensuring the organization is fully self-sustaining at program close.
Vendor and Stakeholder Management:
- Own vendor relationships, contract performance, and accountability across the hardware, networking, platform, and managed-services stack.
- Manage budget ownership for the AI infrastructure and platform engineering organization, including capital planning and operational expense governance.
Requirements
The successful candidate will demonstrate technical depth, executive presence, and a proven record of operating physical infrastructure at data center scale. The ideal candidate will bring the following experience, knowledge, and abilities:
- 10+ years of engineering leadership experience, with substantial time directly owning physical infrastructure at data center scale - including hardware lifecycle, capacity planning, and facility coordination (power, cooling, rack-and-stack execution).
- Hands-on production ownership of bare-metal Kubernetes or OpenShift. Managed cloud services (EKS, GKE, AKS) alone do not substitute for the practitioner expertise this role requires.
- Fluency with high-speed cluster fabrics - RoCE v2, InfiniBand, EVPN-VXLAN, or carrier-grade equivalent - and the operational discipline these fabrics require (PFC, ECN, lossless tuning, congestion management).
- 5+ years leading multiple technical teams simultaneously, including 24/7 operations organizations, with measurable team health, retention, and performance outcomes.
- Proven success establishing and enforcing operational baselines, SLO / SLI / error-budget frameworks, and observability-driven continuous improvement in physical-infrastructure-anchored environments.
- Hardware lifecycle, vendor accountability, and facility coordination experience - including capacity planning, RMA management, and multi-vendor escalation.
- Experience leading operational transitions or organizational build-outs at scale, with business continuity and minimal disruption as non-negotiables.
- Executive-level stakeholder communication, vendor negotiation, and budget ownership., * Hands-on experience with Cisco UCS, NVIDIA HGX / DGX / Blackwell systems, and VAST or comparable distributed NVMe storage.
- Direct experience operating GPU clusters of 32 or more GPUs in production environments - including HPC, AI training, research computing, or comparable workloads.
- NVIDIA AI Enterprise, NVIDIA Run:AI, NVIDIA Base Command Manager, or comparable GPU orchestration platform experience.
- Healthcare or other regulated-industry background (HIPAA, NIST AI RMF, SOX, FedRAMP, ITAR).
- Chaos engineering and AI-driven operations experience - predictive alerting and automated remediation patterns.
- Background in innovation programs, POD structures, or centers of excellence., * Required: Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related technical fiel, Bachelors Degree