High Performance Computing (HPC) Systems Architect
Role details
Job location
Tech stack
Job description
- Manage, monitor, and maintain the new HPC cluster, including compute, GPU, high-memory, hybrid, and management nodes.
- Working with researchers or data scientists, in an academic, public health, or scientific research context
- Oversee and optimize the Slurm job scheduler, including configuration, policies, queues, troubleshooting user jobs, and performance tuning.
- Operate and support Tier 1 storage (PixStore) and its integration with Tier 2 storage (Dell EMC Isilon/PowerScale).
- Act as a technical liaison between IT and the research community, supporting researchers and data scientists with onboarding, software configuration, and workload optimization.
- Translate research needs into practical workflows on the cluster, providing guidance on best practices for running jobs and managing data.
- Develop user-facing documentation, quick-start guides, and FAQs for researchers and data scientists.
- Deliver trainings, workshops, and onboarding sessions to help users learn command-line basics, use scientific tools, and manage jobs via Slurm.
- Collaborate with internal teams, faculty liaisons, and external vendors on support, enhancements, and long-term planning for HPC services., This position follows a hybrid work model with a mix of on-site and remote work, typically three days on-site and two days remote, with flexibility. The candidate must be available to come on-site as needed for data center access, physical hardware issues, and vendor visits. The schedule consists of standard daytime hours with some flexibility required for urgent issues affecting research workloads.
Requirements
An organization is deploying a new High Performance Computing (HPC) cluster and seeks an HPC-focused professional to administer, support, and enable research workloads on this environment. This role is focused on HPC environment management and research support, acting as the primary owner and advocate for the HPC environment and its users within a team that has strong Linux system administration expertise. Previous Data Scientist or mentorship of Data Scientist preferred. Experience working with researchers or data scientists, ideally in an academic, public health, or scientific research context, is necessary., Experience: Hands-on experience with High Performance Computing (HPC) environments is required, not just standalone Linux servers. Experience working with researchers or data scientists, ideally in an academic, public health, or scientific research context, is necessary. The candidate must be comfortable working on-site with physical hardware and data center environments.
Technical Skills: Strong Linux experience, particularly in a server or cluster environment, is required. Practical experience with job schedulers, specifically Slurm (configuration, job submission, troubleshooting, and optimization), is essential. The role requires the ability to understand and support software commonly used in research/HPC, such as Python-based workflows and scientific libraries, and to communicate technical concepts clearly to non-experts.
Preferred Qualifications
- A heavy background as a data scientist or in a research computing support role.
- Experience with PixStore or similar high-performance storage systems.
- Experience with Dell EMC Isilon / PowerScale or other large-scale NAS platforms.
- Familiarity with Bash, shell scripting, and scientific Python ecosystems.
- Prior experience designing or managing HPC clusters and delivering user training.
- Experience in higher education, healthcare, or public health research environments.