HPC Linux Storage Engineer

Oak Ridge National Laboratory
Oak Ridge, United States of America
1 month ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Oak Ridge, United States of America

Tech stack

Microsoft Word
HTML
Artificial Intelligence
Bash
Configuration Management
System Configuration
Linux
RAID
File Systems
Perl
General Parallel File Systems
InfiniBand
Systems Analysis
Storage Area Network (SAN)
Job Scheduling
Python
Kernel-Based Virtual Machine
Nagios
Open Source Technology
Performance Tuning
Red Hat Enterprise Linux - RHEL
Ansible
Rich Text Format
Software Engineering
Supercomputing
Tape Libraries
Virtualization Technology
Weka
Scripting (Bash/Python/Go/Ruby)
Data Storage Technologies
Grafana
GIT
Adobe
Containerization
Information Technology
Performance Monitor
Data Management
Slurm
ZFS File System
Puppet
Docker
Nvme
VMware

Job description

  • Design and Management of Infrastructure:Architect, deploy, and manage large-scale storage systems and HPC platforms to support research, scientific, and enterprise workloads. Develop and implement solutions for structured, unstructured, and archival data storage, focusing on scalability, reliability, and performance.
  • Systems Analysis and Development:Apply systems analysis techniques to consult with users/customers, determine functional requirements, and design, test, or optimize storage and computational solutions tailored to their needs. Develop, document, and modify solutions, including system prototypes and automated workflows, to enhance operational efficiency.
  • Performance, Optimization, and Troubleshooting:Ensure the performance, availability, scalability, and security of diverse infrastructure environments. Diagnose and resolve complex operational challenges quickly and effectively, applying advanced performance optimization techniques for a wide range of workloads.
  • Collaboration and Best Practices:Work closely with stakeholders from research, technical, and operational teams to understand workflows, identify opportunities for improvement, and deliver effective solutions. Define, implement, and enforce best practices, standards, and procedures across projects and teams.
  • Automation and Innovation:Automate system configuration, provisioning, monitoring, and maintenance to reduce manual efforts and downtime. Evaluate emerging technologies and tools to continuously improve system capabilities, adapt to changing needs, and plan for future advancements.
  • Support and Maintenance:Support critical infrastructure through participation in a 24/7 on-call rotation and off-hours maintenance windows. Resolve hardware and software issues in coordination with vendors, ensuring minimal impact on operations., * Work on the world's most powerful supercomputers, including Frontier, the first system to achieve exascale performance.
  • Enable breakthrough science in fields like fusion energy, climate modeling, AI, and national security.
  • Collaborate with diverse teams of scientists, engineers, and technologists from across the DOE complex and academia.
  • Grow your career in a mission-driven, innovation-focused environment with access to professional development and leadership opportunities.
  • Enjoy life in East Tennessee, with a thriving research community, scenic outdoor recreation, and a high quality of life.

This position will remain open for a minimum of 5 days after which it will close when a qualified candidate is identified and/or hired.

We accept Word (.doc, .docx), Adobe (unsecured .pdf), Rich Text Format (.rtf), and HTML (.htm, .html) up to 5MB in size. Resumes from third party vendors will not be accepted; these resumes will be deleted and the candidates submitted will not be considered for employment.

Requirements

  • Bachelor's degree in computer science, engineering, information technology, or a related field; and at least 5 years of professional experience managing Linux/UNIX systems in heterogeneous environments. An equivalent combination of education and experience will be considered.
  • Demonstrated experience with high-performance computing (HPC) storage systems and enterprise storage platforms (e.g., Lustre, GPFS, BeeGFS, or WEKA).
  • Proficiency in scripting languages (e.g., Python, Bash, Perl) and configuration management/automation tools (e.g., Ansible, Puppet, Git).
  • Strong communication, collaboration, and problem-solving skills with the ability to design and implement solutions independently., * Active DOE Q, DoD Top Secret, or TS/SCI clearance.
  • Hands-on experience with HPC cluster technologies, including job schedulers (e.g., SLURM) and system deployment tools (e.g., Warewulf, PXEboot, Bright Cluster Manager).
  • Expertise in high-performance parallel file systems, tape library systems, and storage networking technologies (e.g., RAID, ZFS, NVMe-oF, Infiniband).
  • Familiarity with performance monitoring tools (e.g., Grafana, Nagios), benchmarking systems, and I/O optimization techniques.
  • Experience with virtualization and containerization platforms (e.g., VMware, KVM, Podman, Apptainer).
  • Background in open source development, including submitting patches upstream, and building custom Linux packages (e.g., RPM for RHEL).
  • Demonstrated ability to troubleshoot and optimize high-performance storage, compute, and networking systems in HPC environments.
  • Experience documenting technical processes and contributing to complex technical projects in government, scientific, or highly technical settings.

Hybrid Eligibility

These positions are located in Oak Ridge, Tennessee and require onsite presence. We offer a flexible work environment that supports both the organization and the employee. A hybrid/onsite working arrangement may be available with this position, which provides flexibility to work periodically from your home, while reporting onsite to the Oak Ridge, Tennessee location on a weekly and regular basis.

Special Requirement

This position requires the ability to obtain and maintain clearance from the Department of Energy. As such, this position is a Workplace Substance Abuse (WSAP) testing designated position. WSAP positions require passing a pre-placement drug test and participation in an ongoing random drug testing program.

About the company

Oak Ridge National Laboratory (ORNL), home to some of the world's most powerful supercomputers, is seeking highly skilled professionals to support large-scale storage systems, high-speed parallel file systems, and archival solutions critical to advancing scientific discovery and innovation. As part of ORNL's leadership-class computing ecosystem, you will play a vital role in designing, deploying, optimizing, and maintaining infrastructure that powers cutting-edge research across diverse scientific domains. This evergreen posting represents multiple opportunities across ORNL's high-performance computing (HPC) environment, supporting scalable, reliable, and secure computing and storage capabilities. Applications are reviewed on an ongoing basis as new positions become available to meet the dynamic needs of our world-class computing facility., As a U.S. Department of Energy (DOE) Office of Science national laboratory, ORNL has an impressive 80-year legacy of addressing the nation's most pressing challenges. Our team is made up of over 7,000 dedicated and innovative individuals! Our goal is to create an environment where a variety of perspectives and backgrounds are valued, ensuring ORNL is known as a top choice for employment. These principles are essential for supporting our broader mission to drive scientific breakthroughs and translate them into solutions for energy, environmental, and security challenges facing the nation.

Apply for this position