Engineering Manager, HPC Platform

GTN Technical Staffing

Dallas, United States of America

4 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Dallas, United States of America

Tech stack

Artificial Intelligence

Computing Platforms

Linux

Distributed Data Store

InfiniBand

Machine Learning

Open Source Technology

Performance Tuning

Ansible

Prometheus

Scientific Computating

Systems Architecture

Systems Integration

AI Infrastructure

Large Language Models

Grafana

Kubernetes

Bare Metal

Slurm

Machine Learning Operations

Terraform

Job description

We are seeking an Engineering Manager, HPC Platform to lead the design, scaling, and operational excellence of a bare-metal Kubernetes platform powering HPC, AI/ML workloads, and next-generation CaaS / GPUaaS environments.

This organization operates at the forefront of high-performance computing and AI infrastructure, building platforms that support large-scale research, simulation, and production workloads. This role will lead a team responsible for delivering multi-tenant, GPU-accelerated compute platforms, enabling GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities across distributed data center environments.

This is a hands-on leadership role focused on platform performance, reliability, and automation. You will define the technical roadmap, guide system architecture, and ensure the platform delivers high-throughput, low-latency performance at scale for distributed HPC and AI workloads.

Key Responsibilities

Leadership & Team Development

Lead, mentor, and grow a team of engineers building and scaling HPC and Kubernetes-based platform infrastructure
Foster a culture of ownership, operational excellence, and continuous improvement
Drive alignment across engineering, platform, and infrastructure teams

Platform Architecture & Engineering

Architect and scale a bare-metal Kubernetes platform supporting HPC, AI/ML, and CaaS / GPUaaS workloads
Design and optimize multi-tenant GPU and CPU environments, including workload isolation, scheduling, and resource management
Define architecture patterns for high-performance, distributed compute platforms

GPU Platform & Workload Optimization

Optimize GPU utilization, scheduling, and performance across large-scale clusters
Support AI/ML training, LLM workloads, and scientific computing at scale
Ensure efficient workload orchestration across Kubernetes and HPC scheduling environments

Automation, SRE & Platform Operations

Drive automation using Infrastructure-as-Code (Terraform, Ansible) and CI/CD pipelines
Implement SRE best practices for reliability, observability, and incident response
Build scalable operational frameworks supporting large, multi-tenant compute environments

Performance, Reliability & Capacity Planning

Own platform performance, uptime, and scalability across thousands of nodes
Define and track KPIs for system health, utilization, and performance
Lead capacity planning and forecasting aligned with rapid compute growth

Cross-Functional Collaboration

Partner with research, storage, and networking teams to integrate distributed filesystems and high-speed interconnects (InfiniBand, RoCE)
Collaborate with hardware and software vendors to improve platform capabilities and deployment efficiency
Align platform architecture with evolving HPC, AI, and GPUaaS / CaaS delivery models

Requirements

7+ years of experience in infrastructure, platform, or SRE engineering, with 2+ years in a technical leadership role
Proven experience operating Kubernetes environments for HPC, AI/ML, or GPU-accelerated workloads
Experience designing or supporting CaaS, GPUaaS, or multi-tenant compute platforms
Deep expertise in Linux systems, networking, and performance optimization on bare-metal infrastructure
Experience managing large-scale distributed clusters and integrating storage and high-performance networking
Strong experience with automation tools (Terraform, Ansible) and observability platforms (Prometheus, Grafana, Loki)
Strong communication and leadership skills with the ability to translate technical direction into execution

Preferred Experience

Familiarity with HPC schedulers (Slurm, Flux) and hybrid scheduling models
Experience with container runtimes (containerd, CRI-O) and Kubernetes internals
Contributions to open-source Kubernetes, HPC, or ML infrastructure projects
Experience operating in hyperscale or AI-focused infrastructure environments

Additional Requirements

This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship.
We are unable to sponsor or take over sponsorship of employment visas at this time.