GPU Cluster Architect - Data Center

Hamilton Barnes

Amsterdam, Netherlands

8 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Remote

Amsterdam, Netherlands

Tech stack

Artificial Intelligence

Data Centers

Ethernet

InfiniBand

Python

Systems Architecture

Scripting (Bash/Python/Go/Ruby)

Graphics Processing Unit (GPU)

Large Language Models

Reliability of Systems

Kubernetes

Low Latency

Slurm

Job description

We're looking for a GPU Cluster Architect to lead the design and development of their next-generation AI infrastructure powering large-scale, GPU-accelerated workloads. In this hands-on role, you'll own architectural decisions across compute, networking, and storage, building platforms capable of supporting the scale, performance, and reliability demands of modern AI and ML systems.

You'll define how tens of thousands of GPUs are interconnected, powered, cooled, and optimized across multiple data center sites. Working alongside world-class engineering teams, you'll shape the backbone of one of the most advanced AI clouds in the world.

If you're passionate about designing ultra-scale systems, optimizing performance for LLM training and inference, and building the core infrastructure that powers AI innovation, this is your opportunity. Responsibilities

Architect scalable GPU cluster topologies spanning compute nodes, interconnects (InfiniBand, Ethernet), storage, and control planes
Model and analyze AI/ML workloads (LLM training, inference) to drive tradeoffs in latency, bandwidth, GPU density, and performance
Collaborate with network architects to design and validate low-latency, high-throughput interconnects (InfiniBand HDR/NDR, RoCEv2) at POD and data center scale
Integrate and optimize storage solutions to support training datasets, checkpointing, and high-performance I/O operations
Design for reliability, incorporating telemetry, automation, and monitoring to detect and resolve issues early
Partner with cross-functional teams including SRE, networking, storage, and data center engineering to operationalize your designs

Requirements

5+ years of experience designing GPU or HPC clusters at scale
Deep understanding of modern GPU architectures (NVIDIA, AMD)
Expertise with HPC interconnects (InfiniBand, RoCE) and low-latency networking
Strong background in systems architecture, compute, and hardware reliability
Proficiency in scripting and automation (Python, Go)

Bonus

Experience with AI/ML workload optimization and performance modeling
Familiarity with large-scale data center design and cooling/power strategies
Exposure to orchestration systems (Kubernetes, Slurm) or telemetry frameworks

Benefits & conditions

Bonus scheme
Company shares
Flexible remote working

Salary

Up to €200,000 gross per year

#J-18808-Ljbffr Salarisomschrijving

€200000 - €200000 monthly

About the company

We are partnered with a fast-growing global technology organisation specialising in full-stack cloud infrastructure designed for the artificial intelligence era. Headquartered in Amsterdam and listed on Nasdaq, it builds and operates cutting-edge AI cloud platforms and large-scale GPU-powered data centres that enable developers, researchers and enterprises to train, deploy and scale AI workloads with unmatched performance and reliability. With a presence across Europe, North America and Israel, the business combines deep technical expertise with a mission to democratise access to advanced AI infrastructure, supporting innovation across sectors from life sciences to media and beyond.