Machine Learning Engineer

Postaladdress

Barcelona, Spain

5 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Barcelona, Spain

Tech stack

Systems Engineering

C++

Profiling

Nvidia CUDA

Software Debugging

Distributed Computing Environment

PyTorch

Low Latency

Job description

You'll work on the performance and efficiency of large-scale training workloads, helping to improve how advanced models are trained, scaled, optimised, and served in production. This role sits at the intersection of research and systems engineering , with a strong focus on distributed training, profiling, memory optimisation, and model efficiency. This is the chance to work on genuinely hard problems in foundation model development with meaningful real-world application. What you'll be doing Profile end-to-end distributed training runs to identify bottlenecks across compute, GPU memory, and inter-GPU communication Improve the efficiency and reliability of large-scale training jobs, including contributing to architectural decisions and developing Triton/CUDA kernels where needed Design and implement model scaling, parallelisation, and memory optimisation techniques for very large context training workloads Partner closely with ML Researchers to diagnose inefficiencies, ensure new ideas scale effectively, and share best practice around model performance Support the productionisation and serving of models from the research side, including improving inference efficiency through techniques such as quantisation ?? Barcelona, Spain - hybrid working [Firm will offer relocation support for Barcelona] ?? Highly competitive salary + benefits + equity What you bring: Strong understanding of modern ML architectures and large-scale training pipelines Experience running distributed training jobs across multi-GPU systems Advanced profiling and debugging skills across CPU, GPU, memory, latency, and inter-GPU communication

Requirements

Experience with model scaling and parallelisation approaches, including tensor and pipeline parallelism Familiarity with NCCL, MPI, and distributed communication primitives - highly desirable Knowledge of PyTorch and Triton internals - highly desirable Experience with C++ and CUDA - highly desirable