HPC Kubernetes/Slurm Cluster Engineer
Role details
Job location
Tech stack
Job description
We are seeking a highly skilled HPC/AI/ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Kubernetes and Slurm (Slinky). This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI/ML workloads. You will work alongside our team to support in-house, partner, and customer infrastructure, Cluster Engineering & Deployment
- Participate in the design and bring-up of bare metal HPC/AI/ML environments
- Integrate heterogeneous hardware platforms into cohesive scheduling environments.
- Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, warewulf, CI/CD pipelines) for reproducible cluster build-out.
- Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.
Slurm Management
- Configure and operate the Slurm Workload Manager.
- Build custom Slurm plugins and scripts (epilog/prolog, pam_slurm_adopt) to extend functionality and integrate with authentication, health checking, and monitoring.
- Manage federated Slurm setups across multi-site or hybrid cloud environments.
System Administration & Monitoring
- Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
- Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
- Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
- Manage security and access control (LDAP/SSSD, VPN, PAM, SSH session auditing).
User & Stakeholder Support
- Assist cluster users with developing workflows that make efficient use of compute resources.
- Containerize HPC applications with Docker/Podman/Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
- Automate cost accounting and cluster usage reporting.
Requirements
Do you have experience in Linux administration?, * Previous experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
- Expert in Slurm configuration, partition design, QoS/preemption policies, and GRES GPU scheduling.
- Strong background in Linux system administration, networking, and performance tuning for HPC environments.
- Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100/200 GbE), and monitoring stacks.
- Proficient with automation tools (Ansible, Terraform, CI/CD pipelines) and version control.
- Demonstrated ability to operate GPU-accelerated clusters at scale.
- Previous experience managing kubernetes deployments
- Exceptional candidates have familiarity with common AI/ML software package dependencies and researcher workflows, * Linux and HPC cluster system administration: 1 year (Required)
Language:
- English (Required)
Work Location: Hybrid remote in Hickory, NC 28602
Benefits & conditions
Pulled from the full job description
- 401(k)
- Health insurance
- 401(k) matching
- Relocation assistance, * 401(k)
- 401(k) matching
- Health insurance
- Relocation assistance