Lead Engineer - Software & HPC Engineering
Role details
Job location
Tech stack
Job description
We're seeking a Lead HPC Engineer - or an experienced Senior HPC Engineer ready to step up - to take ownership of a large-scale, high-performance computing environment.
You'll support and evolve an HPC cluster of over 10,000 cores, ensuring reliability, performance, and scalability for workloads ranging from single high-precision runs to thousands of parallel simulations.
Working within the Software & HPC Engineering team, you'll collaborate closely with computational scientists, data engineers, and IT specialists to deliver a robust platform that underpins cutting-edge research and development.
Key Responsibilities
- Maintain and optimise HPC hardware, working with external vendors where required
- Manage core system software and ensure platform stability
- Monitor performance, troubleshoot issues, and drive continuous improvements
- Oversee backups of critical data and system configurations
- Schedule and perform maintenance aligned with user activity
- Profile workloads and enhance system efficiency
- Communicate system status, updates, and major issues to stakeholders
- Capture user requirements and contribute to upgrade and capacity planning
- Support procurement processes and vendor negotiations
- Produce clear documentation for both technical teams and end users
- Collaborate across engineering and IT teams on shared infrastructure
Current Environment
You'll be working with a modern HPC stack, including:
- Large-scale multi-vendor server infrastructure (AMD EPYC, Intel Xeon)
- High-speed networking (100Gb LAN) and high-performance storage systems
- Linux-based environments (AlmaLinux, Ubuntu)
- Distributed file systems (Lustre, GlusterFS, NFS)
- HPC tooling including Slurm, Ansible, and monitoring frameworks
- Development ecosystems supporting C++, Fortran, MPI, and Python
Requirements
Essential:
- Degree in Computer Science (or equivalent experience)
- Strong expertise in Linux, HPC systems, storage, and networking
- Experience with MPI and scientific computing environments (C++, Fortran)
- Familiarity with job schedulers and workload management systems
- Scripting skills (Shell, Python) and version control (Git)
- Ability to design, implement, and support complex HPC systems
- Strong analytical thinking and problem-solving skills
- Excellent communication and collaboration abilities
Desirable:
- Deep expertise in HPC optimisation and performance profiling
- Experience with configuration management tools (e.g. Ansible)
- Knowledge of containerisation (e.g. Singularity, Apptainer)
- Experience working with secure or air-gapped environments
- Familiarity with HPC accounting systems and SQL databases
- Experience supporting and training end users
Rullion celebrates and supports diversity and is committed to ensuring equal opportunities for both employees and applicants.