HPC Systems Administrator I or II

University of Utah Health

Salt Lake City, United States of America

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Compensation

$ 77K

Job location

Salt Lake City, United States of America

Tech stack

Intelligent Platform Management Interface

Bash

Computer Clusters

System Configuration

Data Centers

Linux

File Systems

Text Processing

IBM Hardware Management Console

Monitoring of Systems

InfiniBand

IP Addressing

Python

Linux kernel

Linux Commands

Networking Basics

Package Management Systems

Ansible

Scripting (Bash/Python/Go/Ruby)

Graphics Processing Unit (GPU)

High Performance Computing

Grafana

Firewalls (Computer Science)

GIT

Containerization

Kubernetes

Slurm

Software Version Control

Docker

Job description

Under the direction of the HPC Systems Team Lead, you will join a collaborative team of specialists. In this role, you will work closely with senior mentors to develop skills in High Performance Computing ( HPC ) Systems administration. You will assist in supporting cluster and workload orchestration systems within a complex environment, focusing on compute clusters, host-side networking, and storage connectivity.

Work Environment

Schedule: Primarily Monday-Friday during standard business hours; requires occasional flexibility for after-hours maintenance and urgent system remediation.
Primary Work Environment: Standard office environment for administrative tasks, system monitoring, and team collaboration.
On-site Engagement: Regular Data Center access as dictated by infrastructure requirements and hardware deployment cycles.
Physical Requirements: Ability to move and position server equipment weighing up to 50 lbs. into rack systems, with or without reasonable accommodation (note: server lifts and team-lifting protocols are standard practice for heavier or awkward loads)

Responsibilities

HPC Systems Operations & Maintenance (60%)

Cluster Administration: Assist in the provisioning, configuration, and maintenance of Linux cluster nodes ( CPU and GPU ) and drivers required for hardware functionality and storage connectivity.
Hardware Support: Write and follow structured procedure to troubleshoot hardware failures on cluster nodes (including accelerators/GPUs). Work with vendors to request RMAs and coordinate component replacements.
Host-Side Networking: Support basic host-side connectivity and driver configuration for high-speed networks (Infiniband and RoCE v2) under the direction of senior engineers.
Documentation & Tracking: Maintain and contribute to communal technical documentation. Ensure system health tracking logs and hardware lifecycle records are kept current to support team-wide visibility and historical analysis.
Monitoring: Utilize HPC community tools to check and validate system health. Collaborate with the CHPC monitoring team to implement metrics and visualization using tools like Grafana.

Specialized User Support (25%)

Customer Service: Provide professional, clear, and timely communication while troubleshooting issues to ensure researcher success.
Triage: Act as a first point of contact for researcher support tickets specifically related to HPC system hardware and batch/scheduling systems.
Documentation: Participate in the documentation of system configurations, deployment processes, and support procedures. Contribute to user-facing documentation to ensure accurate knowledge transfer

Skill Development (15%)

Participate in structured training and mentorship to learn:
Mentorship: Participate in structured training to learn advanced batch scheduling (Slurm), parallel file systems (Lustre), and security compliance ( HIPAA , NIST 800-171).
Future Technologies: As part of our commitment to innovation, you will have opportunities to assist senior staff in exploring and piloting emerging workload orchestration modes beyond traditional batch, such as Kubernetes (K8s), Flux, or Slinky.

Requirements

EQUIVALENCY STATEMENT : 1 year of higher education can be substituted for 1 year of directly related work experience (Example: bachelor's degree = 4 years of directly related work experience)., Systems Administrator, I: Requires a bachelor's (or equivalency) + 2 years of directly related work experience or a master's (or equivalency) degree.

Systems Administrator, II: Requires a bachelor's (or equivalency) + 4 years or a master's (or equivalency) + 2 years of directly related work experience.

Recent graduates with strong academic or project-based Linux experience are encouraged to apply.
Linux Proficiency: Demonstrated ability to navigate the Linux command line (file system navigation, text editing, permissions, package management).
Scripting Fundamentals: Basic proficiency in a scripting language (Bash or Python) to automate simple administrative tasks.
Networking Fundamentals: Understanding of basic networking concepts (IP addressing, SSH , ports, firewalls).

Analytical Problem-Solving: Ability to troubleshoot technical issues and resolve challenges through a mix of critical thinking and following detailed instructions in a complex environment.

Preferences

Hardware Management: Familiarity with out-of-band management protocols ( IPMI , Redfish) or vendor-specific tools (e.g., Dell iDRAC, HP iLO, Lenovo XClarity).
Linux Internals: Familiarity with Linux cgroups or container technologies (Apptainer/Docker).
Config Management: Experience with Ansible.
HPC Context: Prior exposure to High Performance Computing environments.
Version Control: Experience with Git., The University of Utah values candidates who have experience working in settings with students and possess a strong commitment to improving access to higher education.