HPC Systems Engineer 1

Boise State University
Boise, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Part-time (≤ 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Junior
Compensation
$ 62K

Job location

Boise, United States of America

Tech stack

Amazon Web Services (AWS)
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Systems Engineering
Audit Trail
Bash
Command-Line Interface
CMake
Configuration Management
Software Documentation
Nvidia CUDA
Data Infrastructure
ETL
Linux
File Systems
Distributed Data Store
Github
General Parallel File Systems
General-Purpose Computing on Graphics Processing Units
Monitoring of Systems
Issue Tracking Systems
InfiniBand
Knowledge Management
Linux System Administration
Linux Commands
Log Analysis
Package Management Systems
Cadence Virtuoso
Ansible
Runbook
Server Administration
Backup and Restore
Ceph
Curam Configuration Tools
Scripting (Bash/Python/Go/Ruby)
Cloud Platform System
High Performance Computing
System Availability
Parallel Computation
Gitlab
GIT
Containerization
Infrastructure Automation Frameworks
Slurm
Software Version Control
User Accounts

Job description

This position will provide high-performance computing support for Idaho's academic research community. The position is located at the Collaborative Computing Center (C3) in Idaho Falls, ID on the campus of Idaho National Lab (INL). The position is responsible for server administration, evaluation, planning, configuration, and installation, as well as the deployment, tuning and troubleshooting of multiple large scale HPC environments. On-call rotations may be required., 60% of the time the HPC Systems Engineer 1 must:

  • Administer production HPC systems supporting university researchers, research centers, and sponsored projects, maintaining system availability, integrity, and performance
  • Evaluate and recommend cluster-based solutions for research workloads, including capacity planning for compute, storage, backup and restore, and data movement requirements
  • Collaborate with research computing staff and scientific support specialists to support researchers across diverse domains and application stacks
  • Manage user accounts, allocations, and group memberships; provide first- and second-tier support for job submission, scheduler troubleshooting, storage quotas, and Linux command-line assistance
  • Install, build, and maintain scientific software-from vendor-supplied packages to user-requested builds from source-using Spack and manual compilation as appropriate
  • Triage and resolve technical issues spanning the Linux OS, networking, parallel filesystems, interconnects, and clustered scientific applications
  • Create, update, and close tickets in the research computing ticketing system according to established service standards
  • Develop and maintain system documentation, runbooks, knowledge base articles, and user-facing training materials using wikis and knowledge management tools
  • Maintain inventory and lifecycle tracking of HPC equipment, including procurement support, receiving, decommissioning, and warranty records
  • Participate in on-call rotation for after-hours maintenance windows and service-down scenarios

35% of the time the HPC Systems Engineer 1:

  • Configure and maintain Slurm scheduling systems, including partitions, accounts, QoS, fairshare, and preemption policies
  • Operate and extend cluster provisioning systems (Warewulf) and environment module systems (Lmod)
  • Manage and extend Open OnDemand deployments and interactive computing interfaces for research users
  • Develop and maintain infrastructure-as-code using configuration management tools (Ansible, Chef) and version-controlled repositories
  • Implement and monitor system performance, utilization, and health using monitoring and metrics platforms; analyze results and implement tuning or capacity changes
  • Support implementation of System Security Plans, including access controls, patching cadence, audit logging, and compliance documentation for regulated data environments where applicable
  • Propose, maintain, and enforce operational policies, practices, and security procedures in coordination with OIT and the security team

5% of the time the HPC Systems Engineer 1:

  • Perform other duties as assigned

Requirements

Entry-level professional with limited or no prior experience to contribute on a project or work team. Incumbent learns to use professional concepts to resolve problems of limited scope and complexity under close supervision while achieving day-to-day objectives. Works on developmental assignments that are initially routine in nature, requiring limited judgment and decision making. This level is typically focused on self-development. Requires theoretical knowledge through specific education and training., * Familiarity with HPC cluster architecture, including compute nodes, high-speed interconnects (InfiniBand or Omni-Path), parallel filesystems (Lustre, GPFS, BeeGFS), and distributed storage systems (Ceph, NFS)

  • Familiarity with HPC workload managers, particularly Slurm, including job submission, partition and QoS configuration, and fairshare scheduling
  • Familiarity with cluster provisioning and configuration tools such as Warewulf, Bright, or equivalent stateless/stateful provisioning systems
  • Familiarity with environment module systems (Lmod, Environment Modules) and user-facing HPC portals such as Open OnDemand
  • Familiarity with cloud computing platforms and hybrid HPC deployments, particularly AWS services for research computing (EC2, S3, ParallelCluster)

Linux and scripting

  • Demonstrated proficiency with Linux system administration, including command-line tools, package management, system service management, user and permissions management, file systems, and log analysis
  • Demonstrated proficiency in shell (Bash) and Python scripting for automation, with the ability to develop, deploy, and schedule scripts in production environments
  • Experience installing and maintaining scientific applications on Linux, including building from source with autotools, CMake, and scientific software stack managers (Spack)

Parallel computing software stack

  • Experience with HPC compiler toolchains (GCC, Intel oneAPI, NVIDIA HPC SDK) and MPI implementations (OpenMPI, MPICH, Intel MPI)
  • Familiarity with GPU computing ecosystems (CUDA Toolkit) and workload scheduling of GPU resources
  • Exposure to containerized HPC workloads using Apptainer or equivalent container technologies

Configuration management and version control

  • Experience with version control using Git, including branching workflows and collaborative platforms (GitHub, GitLab)
  • Familiarity with configuration management and automation tools such as Ansible or Chef for infrastructure-as-code practices

Professional skills

  • Strong verbal and written communication skills, with the ability to translate technical concepts for research users across a wide range of technical backgrounds
  • Ability to manage, prioritize, and make progress on multiple concurrent projects with minimal supervision
  • Customer-service orientation when supporting a broad research user community with patience and professionalism
  • Problem-solving and critical thinking skills to diagnose and resolve complex technical issues methodically, Bachelor's Degree or equivalent experience.

Benefits & conditions

Starting salary is $66,705.60 annually and is commensurate with experience. Boise State University provides a best-in-class benefits package, including (but not limited to):

  • 12 paid holidays AND the University is closed between Christmas and New Year's (requires use of 3 vacation days)
  • Between 12-24 annual paid vacation days for full-time Professional and Classified staff depending on position type and years of service
  • 10.76% University contribution to your ORP retirement fund (Professional and Faculty employees)
  • 11.96% University contribution to your PERSI retirement fund (Classified employees)
  • Excellent medical, dental and other health-related insurance coverages
  • Tuition fee waiver benefits for employees, spouses and their dependents, $110,500.00 - $130,000.00 per year

About the company

Research Computing is advancing research at Boise State through innovative technical partnerships and grant development to support a robust cyberinfrastructure. Check it out: https://www.boisestate.edu/rcs, Nestled along the Boise River and steps from the state capitol, Boise State University fosters a vibrant and welcoming academic environment that fuels student and employee success. We're a trailblazing institution, nationally recognized for our innovative spirit and commitment to positive impact on Idaho and beyond. Boise State is proud to be recognized by Forbes as the only Idaho employer listed in the top 100 of all national midsize and large employers. We're building a thriving community of faculty and staff whose unique skills, experiences, and perspectives come together to create a rich and rewarding academic experience. Applications from all backgrounds are welcomed. Learn more about Boise State and living in Idaho's Treasure Valley at https://www.boisestate.edu/about

Apply for this position