HPC Linux Storage Engineer
Role details
Job location
Tech stack
Job description
- Design and Management of Infrastructure:Architect, deploy, and manage large-scale storage systems and HPC platforms to support research, scientific, and enterprise workloads. Develop and implement solutions for structured, unstructured, and archival data storage, focusing on scalability, reliability, and performance.
- Systems Analysis and Development:Apply systems analysis techniques to consult with users/customers, determine functional requirements, and design, test, or optimize storage and computational solutions tailored to their needs. Develop, document, and modify solutions, including system prototypes and automated workflows, to enhance operational efficiency.
- Performance, Optimization, and Troubleshooting:Ensure the performance, availability, scalability, and security of diverse infrastructure environments. Diagnose and resolve complex operational challenges quickly and effectively, applying advanced performance optimization techniques for a wide range of workloads.
- Collaboration and Best Practices:Work closely with stakeholders from research, technical, and operational teams to understand workflows, identify opportunities for improvement, and deliver effective solutions. Define, implement, and enforce best practices, standards, and procedures across projects and teams.
- Automation and Innovation:Automate system configuration, provisioning, monitoring, and maintenance to reduce manual efforts and downtime. Evaluate emerging technologies and tools to continuously improve system capabilities, adapt to changing needs, and plan for future advancements.
- Support and Maintenance:Support critical infrastructure through participation in a 24/7 on-call rotation and off-hours maintenance windows. Resolve hardware and software issues in coordination with vendors, ensuring minimal impact on operations., * Work on the world's most powerful supercomputers, including Frontier, the first system to achieve exascale performance.
- Enable breakthrough science in fields like fusion energy, climate modeling, AI, and national security.
- Collaborate with diverse teams of scientists, engineers, and technologists from across the DOE complex and academia.
- Grow your career in a mission-driven, innovation-focused environment with access to professional development and leadership opportunities.
- Enjoy life in East Tennessee, with a thriving research community, scenic outdoor recreation, and a high quality of life.
This position will remain open for a minimum of 5 days after which it will close when a qualified candidate is identified and/or hired.
We accept Word (.doc, .docx), Adobe (unsecured .pdf), Rich Text Format (.rtf), and HTML (.htm, .html) up to 5MB in size. Resumes from third party vendors will not be accepted; these resumes will be deleted and the candidates submitted will not be considered for employment.
Requirements
- Bachelor's degree in computer science, engineering, information technology, or a related field; and at least 5 years of professional experience managing Linux/UNIX systems in heterogeneous environments. An equivalent combination of education and experience will be considered.
- Demonstrated experience with high-performance computing (HPC) storage systems and enterprise storage platforms (e.g., Lustre, GPFS, BeeGFS, or WEKA).
- Proficiency in scripting languages (e.g., Python, Bash, Perl) and configuration management/automation tools (e.g., Ansible, Puppet, Git).
- Strong communication, collaboration, and problem-solving skills with the ability to design and implement solutions independently., * Active DOE Q, DoD Top Secret, or TS/SCI clearance.
- Hands-on experience with HPC cluster technologies, including job schedulers (e.g., SLURM) and system deployment tools (e.g., Warewulf, PXEboot, Bright Cluster Manager).
- Expertise in high-performance parallel file systems, tape library systems, and storage networking technologies (e.g., RAID, ZFS, NVMe-oF, Infiniband).
- Familiarity with performance monitoring tools (e.g., Grafana, Nagios), benchmarking systems, and I/O optimization techniques.
- Experience with virtualization and containerization platforms (e.g., VMware, KVM, Podman, Apptainer).
- Background in open source development, including submitting patches upstream, and building custom Linux packages (e.g., RPM for RHEL).
- Demonstrated ability to troubleshoot and optimize high-performance storage, compute, and networking systems in HPC environments.
- Experience documenting technical processes and contributing to complex technical projects in government, scientific, or highly technical settings.
Hybrid Eligibility
These positions are located in Oak Ridge, Tennessee and require onsite presence. We offer a flexible work environment that supports both the organization and the employee. A hybrid/onsite working arrangement may be available with this position, which provides flexibility to work periodically from your home, while reporting onsite to the Oak Ridge, Tennessee location on a weekly and regular basis.
Special Requirement
This position requires the ability to obtain and maintain clearance from the Department of Energy. As such, this position is a Workplace Substance Abuse (WSAP) testing designated position. WSAP positions require passing a pre-placement drug test and participation in an ongoing random drug testing program.