HPC Systems Engineer
Role details
Job location
Tech stack
Job description
We're looking for a HPC Systems Engineer to help power the compute infrastructure behind our R&D innovation! In this role, you'll support and evolve a highperformance Linux cluster used for physics modeling, simulation, algorithm development, and machinelearning workloads-enabling hundreds of engineers to do their best work every day. You'll play a key role in driving the reliability, performance, and scalability of a shared, missioncritical HPC environment, partnering closely with infrastructure, DevOps, and application teams to keep the platform fast, resilient, and ready for the most demanding computational challenges!, HPC Platform Operations
-
Operate and maintain a large-scale Linux based HPC cluster used for internal R&D workloads
-
Manage compute nodes, login nodes, and supporting infrastructure in a multi-tenant environment
-
Monitor cluster health, performance, and capacity; respond to incidents and degradations Scheduler & Workload Management
-
Configure, tune, and support HPC job schedulers (e.g., SLURM, LSF, PBS, or equivalent)
-
Assist users with job submission issues, resource requests, and queue optimization
-
Help optimize scheduler policies to balance throughput, fairness, and utilization Linux Systems Engineering
-
Install, configure, and maintain Linux operating systems across compute and service nodes
-
Manage OS updates, kernel changes, drivers (including GPU drivers where applicable), and system hardening
-
Troubleshoot complex Linux performance, networking, storage, and process level issues Performance & Scaling
-
Support high throughput and parallel workloads across CPU and GPU resources
-
Diagnose performance bottlenecks across compute, storage, network, and scheduler layers
-
Assist with scaling activities such as node expansions, re provisioning, and hardware refreshes Automation & Reliability
-
Use automation and configuration management tools to ensure consistency across the cluster
-
Contribute to scripting and tooling for node provisioning, validation, and lifecycle management
-
Participate in on call or escalation rotations as required to support a production R&D platform Collaboration & User Support
-
Partner with internal engineering teams to understand workload requirements and usage patterns
-
Provide guidance and best practices for running workloads efficiently on shared HPC systems
-
Contribute to internal documentation and operational runbooks
Requirements
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience 3+ years of handson Linux systems administration experience Direct experience working with HPC or largescale compute environments Practical experience with at least one HPC scheduler (SLURM, LSF, PBS, or similar) Strong Linux troubleshooting skills (processes, memory, I/O, networking, performance analysis) Comfort working in CLIdriven, production infrastructure environments Preferred: Experience supporting GPUaccelerated workloads (CUDA, drivers, GPU scheduling concepts) Familiarity with parallel computing or scientific/engineering workloads Experience with cluster storage systems (e.g., Lustre, BeeGFS, NFS, or highperformance NAS/SAN) Exposure to automation tools (Ansible, scripting, InfrastructureasCode... For full info follow application link.