Senior Performance Engineer - GPU clusters
Role details
Job location
Tech stack
Job description
You'll join a small, senior team that works between the hardware and Linux OS layers, solving performance problems that affect tens of thousands of GPUs. This is hands-on, high-impact engineering where microsecond gains matter and every optimization is felt at global scale.
The GPU & InfiniBand team is responsible for enhancing and optimizing the core components of the Cloud platform, with a specific focus on GPU computing, InfiniBand networks, and the KVM/QEMU stack. You'll work closely with hardware virtualization and device emulation technologies, ensuring high performance and security in multi-GPU, HPC environments.
The role involves analyzing, troubleshooting, and improving infrastructure to support new hardware, fine-tuning system performance, and automating fault detection and resolution in a complex system.
What you'll do
In this position, you will be responsible for:
- Tuning the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments.
- Analyzing and troubleshooting the root cause of issues related to GPUs and InfiniBand networks, and proposing corrective actions.
- Integrating new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM.
- Enhancing automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments.
- Configuring and managing GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation.
Requirements
- 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming).
- 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning).
- In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems.
- Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python).
It would be a plus (but not key....) if you have:
- Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking.
- Proven track record of analyzing and optimizing the performance of HPC workloads (e.g., simulations, data analysis, AI/ML workloads).
- Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication.
- Background in Software-Defined Networking (SDN) and experience with HPC cluster networking.
- Understanding of QEMU/KVM virtualization and managing virtualized environments.
- Experience with deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems.
- Familiarity with collective communication libraries like MPI and NCCL for distributed computing.
This is for you if you
- Love solving deep technical challenges, care about performance down to the microsecond, and want to work on infrastructure that pushes the limits of what's possible.
- Get enthusiastic about the prospect of joining a massively scaling organization, and the chances this offer to take ownership and end-to-end responsibility.
Benefits & conditions
- Salary: up to 200k OTE.
- Flexible working arrangements.
- A dynamic and collaborative work environment that values initiative and innovation.
- Location: Amsterdam or remote.