Engineering Manager, HPC Platform

GTN Technical Staffing
Dallas, United States of America
4 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate

Job location

Dallas, United States of America

Tech stack

Artificial Intelligence
Computing Platforms
Linux
Distributed Data Store
InfiniBand
Machine Learning
Open Source Technology
Performance Tuning
Ansible
Prometheus
Scientific Computating
Systems Architecture
Systems Integration
AI Infrastructure
Large Language Models
Grafana
Kubernetes
Bare Metal
Slurm
Machine Learning Operations
Terraform

Job description

We are seeking an Engineering Manager, HPC Platform to lead the design, scaling, and operational excellence of a bare-metal Kubernetes platform powering HPC, AI/ML workloads, and next-generation CaaS / GPUaaS environments.

This organization operates at the forefront of high-performance computing and AI infrastructure, building platforms that support large-scale research, simulation, and production workloads. This role will lead a team responsible for delivering multi-tenant, GPU-accelerated compute platforms, enabling GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities across distributed data center environments.

This is a hands-on leadership role focused on platform performance, reliability, and automation. You will define the technical roadmap, guide system architecture, and ensure the platform delivers high-throughput, low-latency performance at scale for distributed HPC and AI workloads.

Key Responsibilities

Leadership & Team Development

  • Lead, mentor, and grow a team of engineers building and scaling HPC and Kubernetes-based platform infrastructure
  • Foster a culture of ownership, operational excellence, and continuous improvement
  • Drive alignment across engineering, platform, and infrastructure teams

Platform Architecture & Engineering

  • Architect and scale a bare-metal Kubernetes platform supporting HPC, AI/ML, and CaaS / GPUaaS workloads
  • Design and optimize multi-tenant GPU and CPU environments, including workload isolation, scheduling, and resource management
  • Define architecture patterns for high-performance, distributed compute platforms

GPU Platform & Workload Optimization

  • Optimize GPU utilization, scheduling, and performance across large-scale clusters
  • Support AI/ML training, LLM workloads, and scientific computing at scale
  • Ensure efficient workload orchestration across Kubernetes and HPC scheduling environments

Automation, SRE & Platform Operations

  • Drive automation using Infrastructure-as-Code (Terraform, Ansible) and CI/CD pipelines
  • Implement SRE best practices for reliability, observability, and incident response
  • Build scalable operational frameworks supporting large, multi-tenant compute environments

Performance, Reliability & Capacity Planning

  • Own platform performance, uptime, and scalability across thousands of nodes
  • Define and track KPIs for system health, utilization, and performance
  • Lead capacity planning and forecasting aligned with rapid compute growth

Cross-Functional Collaboration

  • Partner with research, storage, and networking teams to integrate distributed filesystems and high-speed interconnects (InfiniBand, RoCE)
  • Collaborate with hardware and software vendors to improve platform capabilities and deployment efficiency
  • Align platform architecture with evolving HPC, AI, and GPUaaS / CaaS delivery models

Requirements

  • 7+ years of experience in infrastructure, platform, or SRE engineering, with 2+ years in a technical leadership role
  • Proven experience operating Kubernetes environments for HPC, AI/ML, or GPU-accelerated workloads
  • Experience designing or supporting CaaS, GPUaaS, or multi-tenant compute platforms
  • Deep expertise in Linux systems, networking, and performance optimization on bare-metal infrastructure
  • Experience managing large-scale distributed clusters and integrating storage and high-performance networking
  • Strong experience with automation tools (Terraform, Ansible) and observability platforms (Prometheus, Grafana, Loki)
  • Strong communication and leadership skills with the ability to translate technical direction into execution

Preferred Experience

  • Familiarity with HPC schedulers (Slurm, Flux) and hybrid scheduling models
  • Experience with container runtimes (containerd, CRI-O) and Kubernetes internals
  • Contributions to open-source Kubernetes, HPC, or ML infrastructure projects
  • Experience operating in hyperscale or AI-focused infrastructure environments

Additional Requirements

  • This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship.
  • We are unable to sponsor or take over sponsorship of employment visas at this time.

Apply for this position