HPC Linux Engineer
Role details
Job location
Tech stack
Job description
We're looking for a highly capable HPC Linux Engineer to join a small, specialist IT function supporting a demanding compute-led environment. This is a hands-on role for someone who enjoys owning infrastructure end-to-end and improving performance, reliability, and scalability across complex Linux-based systems.
They offer a very strong benefits package including 28 days of annual leave, bonus scheme, equity, private health insurance, as well as a number of smaller benefits.
Working closely with the IT Manager and engineering teams, you'll play a key role in the ongoing development of a high-performance computing (HPC) platform, ensuring it remains secure, efficient, and fit for future growth. This role would suit an experienced Linux Systems Administrator who enjoys problem-solving in technically challenging environments and looking for some additional responsibility.
Key areas of responsibility include:
- Operating and enhancing a high-performance Linux compute platform, covering servers, storage, and associated services
- Tracking utilisation, capacity, and performance, and taking action to prevent issues before they impact users
- Introducing and refining workload management and scheduling solutions within the HPC estate
- Reducing manual effort through effective use of automation
- Contributing to infrastructure roadmaps, upgrades, and scaling decisions
- Creating and maintaining technical standards, runbooks, and system documentation
- Acting as an escalation point for complex platform-related issues
- Maintaining a strong security posture and ensuring systems align with internal policies and external requirements
Requirements
Do you have experience in System administration?, * Several years' experience supporting Linux infrastructure in compute-intensive or highly available environments
- Practical exposure to HPC platforms, including workload schedulers and distributed systems
- Strong knowledge of Red Hat-derived operating systems
- A proven track record of automating operational tasks using scripting or infrastructure-as-code tools
- A solid understanding of networking concepts as they apply to server and cluster environments
- Experience implementing or maintaining backup, recovery, and data protection solutions
- The confidence to communicate clearly with both technical and non-technical stakeholders