HPC Systems Administrator I or II

University of Utah Health
Salt Lake City, United States of America
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate
Compensation
$ 77K

Job location

Salt Lake City, United States of America

Tech stack

Intelligent Platform Management Interface
Bash
Computer Clusters
System Configuration
Data Centers
Linux
File Systems
Text Processing
IBM Hardware Management Console
Monitoring of Systems
InfiniBand
IP Addressing
Python
Linux kernel
Linux Commands
Networking Basics
Package Management Systems
Ansible
Scripting (Bash/Python/Go/Ruby)
Graphics Processing Unit (GPU)
High Performance Computing
Grafana
Firewalls (Computer Science)
GIT
Containerization
Kubernetes
Slurm
Software Version Control
Docker

Job description

Under the direction of the HPC Systems Team Lead, you will join a collaborative team of specialists. In this role, you will work closely with senior mentors to develop skills in High Performance Computing ( HPC ) Systems administration. You will assist in supporting cluster and workload orchestration systems within a complex environment, focusing on compute clusters, host-side networking, and storage connectivity.

Work Environment

  • Schedule: Primarily Monday-Friday during standard business hours; requires occasional flexibility for after-hours maintenance and urgent system remediation.
  • Primary Work Environment: Standard office environment for administrative tasks, system monitoring, and team collaboration.
  • On-site Engagement: Regular Data Center access as dictated by infrastructure requirements and hardware deployment cycles.
  • Physical Requirements: Ability to move and position server equipment weighing up to 50 lbs. into rack systems, with or without reasonable accommodation (note: server lifts and team-lifting protocols are standard practice for heavier or awkward loads)

Responsibilities

  1. HPC Systems Operations & Maintenance (60%)
  • Cluster Administration: Assist in the provisioning, configuration, and maintenance of Linux cluster nodes ( CPU and GPU ) and drivers required for hardware functionality and storage connectivity.
  • Hardware Support: Write and follow structured procedure to troubleshoot hardware failures on cluster nodes (including accelerators/GPUs). Work with vendors to request RMAs and coordinate component replacements.
  • Host-Side Networking: Support basic host-side connectivity and driver configuration for high-speed networks (Infiniband and RoCE v2) under the direction of senior engineers.
  • Documentation & Tracking: Maintain and contribute to communal technical documentation. Ensure system health tracking logs and hardware lifecycle records are kept current to support team-wide visibility and historical analysis.
  • Monitoring: Utilize HPC community tools to check and validate system health. Collaborate with the CHPC monitoring team to implement metrics and visualization using tools like Grafana.
  1. Specialized User Support (25%)
  • Customer Service: Provide professional, clear, and timely communication while troubleshooting issues to ensure researcher success.
  • Triage: Act as a first point of contact for researcher support tickets specifically related to HPC system hardware and batch/scheduling systems.
  • Documentation: Participate in the documentation of system configurations, deployment processes, and support procedures. Contribute to user-facing documentation to ensure accurate knowledge transfer
  1. Skill Development (15%)
  • Participate in structured training and mentorship to learn:
  • Mentorship: Participate in structured training to learn advanced batch scheduling (Slurm), parallel file systems (Lustre), and security compliance ( HIPAA , NIST 800-171).
  • Future Technologies: As part of our commitment to innovation, you will have opportunities to assist senior staff in exploring and piloting emerging workload orchestration modes beyond traditional batch, such as Kubernetes (K8s), Flux, or Slinky.

Requirements

EQUIVALENCY STATEMENT : 1 year of higher education can be substituted for 1 year of directly related work experience (Example: bachelor's degree = 4 years of directly related work experience)., Systems Administrator, I: Requires a bachelor's (or equivalency) + 2 years of directly related work experience or a master's (or equivalency) degree.

Systems Administrator, II: Requires a bachelor's (or equivalency) + 4 years or a master's (or equivalency) + 2 years of directly related work experience.

  • Recent graduates with strong academic or project-based Linux experience are encouraged to apply.
  • Linux Proficiency: Demonstrated ability to navigate the Linux command line (file system navigation, text editing, permissions, package management).
  • Scripting Fundamentals: Basic proficiency in a scripting language (Bash or Python) to automate simple administrative tasks.
  • Networking Fundamentals: Understanding of basic networking concepts (IP addressing, SSH , ports, firewalls).

Analytical Problem-Solving: Ability to troubleshoot technical issues and resolve challenges through a mix of critical thinking and following detailed instructions in a complex environment.

Preferences

  • Hardware Management: Familiarity with out-of-band management protocols ( IPMI , Redfish) or vendor-specific tools (e.g., Dell iDRAC, HP iLO, Lenovo XClarity).
  • Linux Internals: Familiarity with Linux cgroups or container technologies (Apptainer/Docker).
  • Config Management: Experience with Ansible.
  • HPC Context: Prior exposure to High Performance Computing environments.
  • Version Control: Experience with Git., The University of Utah values candidates who have experience working in settings with students and possess a strong commitment to improving access to higher education.

Apply for this position