Linux Systems Engineer - HPC/A (all genders)
Role details
Job location
Tech stack
Job description
We are seeking a talented Linux Systems Engineer - HPC/AI (all genders) who will join a world-class team and help build and operate the foundational infrastructure needed to support groundbreaking research. This is a unique opportunity to be part of something from the early beginning.
As a Linux Systems Engineer with a focus on HPC/AI, you will help build and operate an HPC cluster specialized for AI workloads. This role is ideal for someone with solid Linux systems administration experience who is excited to grow into the world of High-Performance Computing and AI infrastructure. You will contribute to bringing advanced AI solutions to life, using your technical skills to support scalable, reliable, and high-performance systems for cutting-edge research.
Reporting to Stephan Stadlbauer, Head of Scientific Computing, your role combines Linux systems engineering, hardware and infrastructure support, and close collaboration with multidisciplinary teams. This position focuses on helping design, implement, and operate infrastructure for innovative AI research. If you are passionate about Linux systems, on-premises infrastructure, and want to develop further in HPC and AI, this role is an excellent opportunity., * Deploy, rack, cable, configure, and maintain server hardware, GPU nodes, networking equipment, and storage systems in our on-premises data centers.
- Administer and harden a large-scale Linux environment (Debian/Ubuntu) that forms the backbone of the HPC/AI cluster.
- Assist in designing, building, and scaling our HPC cluster specifically optimized for AI workloads - learning HPC best practices along the way.
- Configure and manage the workload manager (SLURM) to efficiently schedule, monitor, and manage diverse jobs including AI training and inference.
- Implement and optimize high-performance storage solutions (e.g., BeeGFS, Lustre) tailored for large-scale AI/HPC datasets and model training.
- Install and configure key software components, including parallel file systems, networking fabrics, and AI-specific libraries and frameworks (e.g., TensorFlow, PyTorch).
- Troubleshoot and resolve complex technical issues related to hardware, software, and networking components during the cluster build and initial operation phases.
- Provide technical support and guidance to scientists for running their AI workloads on the cluster, including job submission, monitoring, and basic troubleshooting.
- Monitor system performance, resource utilization, and job efficiency to optimize throughput and infrastructure.
- Document system design, configurations, procedures, and best practices for building and operating the AI HPC cluster.
Requirements
- Education in Computer Science, Information Technology, or a related field (or equivalent practical experience).
- Solid, hands-on experience in Linux system administration (e.g., Ubuntu, Debian, RHEL) in professional or large-scale environments.
- Proficiency in scripting and automation (e.g., Bash, Python, Lua) for system management, deployment, and monitoring tasks.
- Practical experience with server hardware -- you are comfortable racking equipment, diagnosing hardware faults, and working in a data-center environment.
- Familiarity with configuration management and automation tools (e.g., Ansible, Puppet, Salt) and a strong desire to apply automation best practices at scale.
- Good understanding of networking fundamentals (TCP/IP, VLANs, firewalls, DNS/DHCP); experience with high-speed networking or InfiniBand is a plus. Interest in or initial exposure to HPC concepts (job schedulers, parallel file systems, cluster management) -- with a genuine eagerness to learn and develop deep expertise.
- Interest in or initial exposure to GPU-accelerated computing and AI workloads -- with a willingness to grow into this area.
- Excellent problem-solving skills and a proactive, hands-on attitude towards tackling complex technical challenges in a fast-paced environment.
- Ability to communicate effectively in English and collaborate with technical and research teams.
Desired Skills:
- Experience with HPC systems, cluster management tools, or job schedulers (SLURM, PBS).
- Experience with containers and orchestration (e.g., Docker, Apptainer, Kubernetes).
- Familiarity with parallel or network file systems (e.g., BeeGFS, Lustre, GPFS).
- Exposure to GPU management, CUDA toolkits, or AI frameworks (TensorFlow, PyTorch).
- Experience working with research scientists or in an academic environment.
- Familiarity with monitoring and observability stacks (Prometheus, Grafana, CheckMK).
Benefits & conditions
- A competitive salary (minimum gross annual salary of EUR 58000)
- Support for your wellbeing, including access to a company doctor
- Fresh fruits, sweet treats, and free coffee & tea are available every day
- Flexible working arrangements, with the option for one home office day per week
- Core hours: Monday-Thursday 09:00-15:00, Friday 09:00-13:00
- Meal allowance to make your day a little easier
- A welcoming community with diverse social and cultural activities
- Relocation support to help you settle in comfortably if you're moving to join us