HPC Systems Engineer

KLATencor Corporation

Ann Arbor, United States of America

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Ann Arbor, United States of America

Tech stack

Algorithm Design

Systems Engineering

Computer Clusters

Nvidia CUDA

Linux

DevOps

Job Scheduling

Linux System Administration

Ansible

Scripting (Bash/Python/Go/Ruby)

Parallel Computation

Infrastructure Automation Frameworks

Information Technology

Slurm

Job description

We're looking for a HPC Systems Engineer to help power the compute infrastructure behind our R&D innovation! In this role, you'll support and evolve a highperformance Linux cluster used for physics modeling, simulation, algorithm development, and machinelearning workloads-enabling hundreds of engineers to do their best work every day. You'll play a key role in driving the reliability, performance, and scalability of a shared, missioncritical HPC environment, partnering closely with infrastructure, DevOps, and application teams to keep the platform fast, resilient, and ready for the most demanding computational challenges!, HPC Platform Operations

Operate and maintain a large-scale Linux based HPC cluster used for internal R&D workloads
Manage compute nodes, login nodes, and supporting infrastructure in a multi-tenant environment
Monitor cluster health, performance, and capacity; respond to incidents and degradations Scheduler & Workload Management
Configure, tune, and support HPC job schedulers (e.g., SLURM, LSF, PBS, or equivalent)
Assist users with job submission issues, resource requests, and queue optimization
Help optimize scheduler policies to balance throughput, fairness, and utilization Linux Systems Engineering
Install, configure, and maintain Linux operating systems across compute and service nodes
Manage OS updates, kernel changes, drivers (including GPU drivers where applicable), and system hardening
Troubleshoot complex Linux performance, networking, storage, and process level issues Performance & Scaling
Support high throughput and parallel workloads across CPU and GPU resources
Diagnose performance bottlenecks across compute, storage, network, and scheduler layers
Assist with scaling activities such as node expansions, re provisioning, and hardware refreshes Automation & Reliability
Use automation and configuration management tools to ensure consistency across the cluster
Contribute to scripting and tooling for node provisioning, validation, and lifecycle management
Participate in on call or escalation rotations as required to support a production R&D platform Collaboration & User Support
Partner with internal engineering teams to understand workload requirements and usage patterns
Provide guidance and best practices for running workloads efficiently on shared HPC systems
Contribute to internal documentation and operational runbooks

Requirements

Bachelor's degree in Computer Science, Engineering, or equivalent practical experience 3+ years of handson Linux systems administration experience Direct experience working with HPC or largescale compute environments Practical experience with at least one HPC scheduler (SLURM, LSF, PBS, or similar) Strong Linux troubleshooting skills (processes, memory, I/O, networking, performance analysis) Comfort working in CLIdriven, production infrastructure environments Preferred: Experience supporting GPUaccelerated workloads (CUDA, drivers, GPU scheduling concepts) Familiarity with parallel computing or scientific/engineering workloads Experience with cluster storage systems (e.g., Lustre, BeeGFS, NFS, or highperformance NAS/SAN) Exposure to automation tools (Ansible, scripting, InfrastructureasCode... For full info follow application link.

About the company

Company OverviewKLA is a global leader in diversified electronics for the semiconductor manufacturing ecosystem. Virtually every electronic device in the world is produced using our technologies. No laptop, smartphone, wearable device, voice-controlled gadget, flexible screen, VR device or smart car would have made it into your hands without us. KLA invents systems and solutions for the manufacturing of wafers and reticles, integrated circuits, packaging, printed circuit boards and flat panel displays. The innovative ideas and devices that are advancing humanity all begin with inspiration, research and development. KLA focuses more than average on innovation and we invest 15% of sales back into R&D. Our expert teams of physicists, engineers, data scientists and problem-solvers work together with the world's leading technology providers to accelerate the delivery of tomorrow's electronic devices. Life here is exciting and our teams thrive on tackling really hard problems. There is never a dull moment with us. Group/DivisionWith over 40 years of semiconductor process control experience, chipmakers around the globe rely on KLA to ensure that their fabs ramp next-generation devices to volume production quickly and cost-effectively. Enabling the movement towards advanced chip design, KLA's Global Products Group (GPG), which is responsible for creating all of KLA's metrology and inspection products, is looking for the best and the brightest research scientist, software engineers, application development engineers, and senior product technology process engineers. Central Engineering is KLA's largest engineering organization comprised of 9 Centers-of-Excellence (CoE) in various disciplines applied across all product groups in the company. These CoE include Handling & Automation, Precision Motion Control, Sensors & Image Acquisition, Platform Design, and Packaging Engineering, among others. Talent includes over 500 engineers across global centers in Israel, China, India, and the US. Each CoE contributes not just talent and deliverables per discipline toward product programs, but also subject matter expertise, best practices, roadmaps, specialized facilities, apparatus, models, and analytics. These differentiate KLA not only in WHAT we do, but also in HOW we do it.

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all