HPC Systems Engineer

Job Cloud Inc.
yesterday

Role details

Contract type
Temporary contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Remote

Tech stack

Link Aggregation (Ethernet)
Systems Engineering
Command-Line Interface
Linux
File Systems
Distributed Systems
Firmware
Linux System Administration
NetApp Applications
Network administration
Virtual Local Area Networks
High Performance Computing
Low Latency
Slurm
SnapMirror

Job description

The HPC Systems Engineer supports storage, networking, GPU systems, and compute environments, ensuring system performance, availability, and reliability while troubleshooting issues and supporting users. Storage Administration (NetApp)

  • Administer NetApp storage systems (volumes, aggregates, qtrees, snapshots)
  • Manage replication technologies (SnapMirror, SnapVault)
  • Monitor storage performance (I/O, latency, capacity) and report on trends
  • Troubleshoot storage issues impacting HPC workloads
  • Maintain backup, recovery, and data protection policies Network Administration (Arista)
  • Configure and maintain Arista switches within HPC environments
  • Manage VLANs, ACLs, and link aggregation
  • Support network documentation, topology diagrams, and change management NVIDIA DGX & GPU Systems
  • Support NVIDIA DGX systems including health checks, driver updates, and OS maintenance
  • Monitor GPU utilization, thermal performance, and interconnects (DCGM, nvidia-smi)
  • Troubleshoot and escalate hardware or performance issues HPC Operations
  • Perform system health checks, patching, and firmware updates on HPE servers
  • Support HPC schedulers such as Slurm or PBS (queue monitoring, job troubleshooting)

Requirements

  • 3 5 years of Linux systems administration or HPC infrastructure experience
  • Experience supporting GPU-based systems (NVIDIA preferred)
  • Strong command-line troubleshooting across distributed systems
  • Solid communication and documentation skills
  • Preferred: Advanced Linux experience (7 10+ years)
  • Preferred: Experience with Slurm or similar schedulers
  • Preferred: Exposure to HPCM or parallel file systems
  • Preferred: Familiarity with NiceDCV or similar HPC tools

Apply for this position