HPC Technical Consultant, Onsite (LANL) Los Alamos, NM

Hewlett-Packard Enterprise
Santa Fe, United States of America
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Shift work
Languages
English
Experience level
Junior
Compensation
$ 188K

Job location

Santa Fe, United States of America

Tech stack

Microsoft Word
Microsoft Excel
Microsoft Windows
Artificial Intelligence
Apple Mac Systems
Intelligent Platform Management Interface
Bash
BIOS
Microsoft Outlook
Command-Line Interface
Software Documentation
CompTIA Network+
CompTIA Security+
Computer Networks
Data Centers
Software Debugging
Microprocessors
Disk Arrays
File Systems
Ethernet
Network Interface Controllers
Firmware
Issue Tracking Systems
InfiniBand
Python
Network Troubleshooting
Linux System Administration
Log Analysis
Log Files
Microsoft Office
Performance Tuning
Productivity Software
Server Administration
SharePoint
Syslog
Scripting (Bash/Python/Go/Ruby)
High Performance Computing
Peripherals
Comptia Server+
Computer Equipment
Information Technology
Slack
Hardware Infrastructure
Comptia Linux+
REST
Network Server

Job description

  • Monitor and maintain system health across large-scale HPC compute, network, and storage infrastructure
  • Troubleshoot and repair hardware issues on HPC servers and supporting systems
  • Perform basic Linux system administration tasks as needed
  • Create, monitor, update, and close support tickets
  • Perform hardware component replacements using spares
  • Operate hand tools and low-power tools for server maintenance
  • Track and document hardware repairs, part replacements, and returns
  • Create, update, and maintain site documentation, processes, and workflows
  • Assist with new system installation and expansion activities
  • Read system documentation and diagrams to locate components
  • Collaborate with team members using email, Teams, Slack, and in-person communication
  • Participate in on-call schedule to support 24x7 operations
  • Maintain tools and workspace in an organized manner

Requirements

Candidates must meet all of the following requirements:

  • Ability to obtain a Q Clearance (required)
  • US Citizenship (required)
  • Must be able to work onsite 5 days per week in Los Alamos, NM, with additional onsite work for on-call support. This is not a remote position
  • Strong mechanical aptitude and comfort using common hand tools (screwdrivers, pliers, wrenches, cable tools, etc.) for assembling, disassembling, and maintaining server hardware and related equipment
  • Ability to lift up to 50 lbs individually and up to 75 lbs with assistance
  • Solid understanding of computer hardware components (servers, drives, memory modules, power supplies, cabling, and peripherals)
  • Proficiency with basic computer operations on Windows and macOS (MacBook), including OS navigation, file management, and standard productivity tools such as Slack, SharePoint, Microsoft Office (Word, Excel, Outlook, and Teams)

Preferred Qualifications

A combination of the following is preferred:

  • Associate's degree, some college, or technical training (BS preferred)
  • 2+ years of Linux System Administration Experience, including strong command-line navigation, log analysis and monitoring (journalctl, syslog, log files), troubleshooting system and application issues, and scripting/automation using Bash or Python.
  • Experience using Redfish (along with IPMI) for out-of-band server hardware management and monitoring. This includes utilizing the Redfish RESTful API for querying system health, power/thermal monitoring, firmware inventory, component status (processors, memory, drives, NICs), event logs, and performing actions such as system resets, power control, and BIOS configuration.
  • 2+ years of hands-on experience troubleshooting and maintaining server hardware in a datacenter environment, including diagnosing hardware faults (power, thermal, storage, networking), performing component replacements (drives, memory, CPUs, PSUs, HBAs, NICs), rack mounting/decommissioning servers, and managing cable infrastructure
  • 1+ year of experience with high-speed networking concepts and troubleshooting for Ethernet, HPE Slingshot, and InfiniBand fabrics, including link diagnostics, performance tuning, cable/fiber management, switch configuration, and fault isolation in large-scale HPC environments.
  • Previous experience in a 24x7 production support environment
  • Strong troubleshooting and problem-solving skills with the ability to work independently, including systematically diagnosing complex hardware, software, and network issues through log analysis, debugging tools, and root cause analysis while minimizing downtime in high-availability environments
  • Experience reading technical diagrams, schematics, and working with ticketing systems
  • Experience with Git for version control of code, scripts, configuration files, and documentation (including cloning, branching, committing, merging, and resolving conflicts)
  • Experience with High-Performance Computing (HPC) systems, clusters, or large-scale AI infrastructure
  • Experience with large-scale storage systems, including installation, configuration, monitoring, and troubleshooting of parallel file systems, enterprise SAN/NAS solutions, object storage, and high-capacity disk arrays.

Highly Desired Industry Certifications (any of the following):

  • CompTIA Linux+
  • CompTIA Security+
  • CompTIA Server+
  • CompTIA A+
  • CompTIA Network+
  • ITIL Foundation

Benefits & conditions

"The expected salary/wage range for this position is provided below. Actual offer may vary from this range based upon geographic location, work experience, education/training, and/or skill level.

  • United States of America: Annual Salary USD 81,500 - 187,500 in New Mexico The listed salary range reflects base salary. Variable incentives may also be offered."

About the company

This role has been designed as ''Onsite' with an expectation that you will primarily work from an HPE partner/customer office., Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help companies connect, protect, analyze, and act on their data and applications wherever they live, from edge to cloud, so they can turn insights into outcomes at the speed required to thrive in today's complex world. Our culture thrives on finding new and better ways to accelerate what's next. We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good. If you are looking to stretch and grow your career our culture will embrace you. Open up opportunities with HPE., Join a dedicated on-site team supporting operations and hardware maintenance for HPE supercomputers in one of the nation's premier High-Performance Computing facilities.

Apply for this position