HPC Kubernetes/Slurm Cluster Engineer

Fluid Numerics, LLC

Hickory, United States of America

2 months ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Junior

Compensation

$ 140K

Job location

Hickory, United States of America

Tech stack

Artificial Intelligence

Continuous Integration

Linux

File Systems

InfiniBand

Virtual Private Networks (VPN)

Lightweight Directory Access Protocols (LDAP)

Linux System Administration

Network Configuration and Change Management

Performance Tuning

Ansible

Enterprise Software Applications

High Performance Computing

HybridCloud

Containerization

Kubernetes

Bare Metal

Slurm

Terraform

Software Version Control

Docker

Job description

We are seeking a highly skilled HPC/AI/ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Kubernetes and Slurm (Slinky). This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI/ML workloads. You will work alongside our team to support in-house, partner, and customer infrastructure, Cluster Engineering & Deployment

Participate in the design and bring-up of bare metal HPC/AI/ML environments
Integrate heterogeneous hardware platforms into cohesive scheduling environments.
Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, warewulf, CI/CD pipelines) for reproducible cluster build-out.
Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

Configure and operate the Slurm Workload Manager.
Build custom Slurm plugins and scripts (epilog/prolog, pam_slurm_adopt) to extend functionality and integrate with authentication, health checking, and monitoring.
Manage federated Slurm setups across multi-site or hybrid cloud environments.

System Administration & Monitoring

Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
Manage security and access control (LDAP/SSSD, VPN, PAM, SSH session auditing).

User & Stakeholder Support

Assist cluster users with developing workflows that make efficient use of compute resources.
Containerize HPC applications with Docker/Podman/Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
Automate cost accounting and cluster usage reporting.

Requirements

Do you have experience in Linux administration?, * Previous experience in HPC cluster administration and engineering, with deep knowledge of Slurm.

Expert in Slurm configuration, partition design, QoS/preemption policies, and GRES GPU scheduling.
Strong background in Linux system administration, networking, and performance tuning for HPC environments.
Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100/200 GbE), and monitoring stacks.
Proficient with automation tools (Ansible, Terraform, CI/CD pipelines) and version control.
Demonstrated ability to operate GPU-accelerated clusters at scale.
Previous experience managing kubernetes deployments
Exceptional candidates have familiarity with common AI/ML software package dependencies and researcher workflows, * Linux and HPC cluster system administration: 1 year (Required)

Language:

English (Required)

Work Location: Hybrid remote in Hickory, NC 28602

Benefits & conditions

Pulled from the full job description

401(k)
Health insurance
401(k) matching
Relocation assistance, * 401(k)
401(k) matching
Health insurance
Relocation assistance

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all