Machine Learning Engineer

Client Server

Charing Cross, United Kingdom

4 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Compensation

£ 110K

Job location

Remote

Charing Cross, United Kingdom

Tech stack

Software Debugging

Distributed Data Store

Distributed Systems

Python

Machine Learning

Open Source Technology

Performance Tuning

Software Engineering

TypeScript

Reinforcement Learning

PyTorch

Large Language Models

Prompt Engineering

Deep Learning

Kubernetes

Slurm

Job description

As a Machine Learning Engineer you'll take open-source LLMs (code and general models) and turn them into high-performance software engineer agents using supervised fine tuning and large scale reinforcement learning. This isn't prompt engineering. You'll design and run serious training experiments across multi-node GPU clusters, build RL loops where models write code and get rewarded (or penalised) by real test outcomes and push long-context and MoE style architectures to their limits.

You'll work hands-on across the full stack: custom PyTorch dataloaders, distributed training (DDP/FSDP), experiment tracking, debugging NCCL issues at 2am, and squeezing performance out of multi-GPU jobs. You'll help design opinionated reward functions that reflect what great engineering actually looks like, not just benchmark scores.

You'll extend benchmark suites, test models on real world repositories, analyse failure modes and feed insights back into data and training strategy. Collaborating with infrastructure, product and research teams you'll contribute to decisions about what to train next and how to measure results.

Location / WFH:

You'll be based in the London, dog friendly office on a fulltime basis, with daily catered lunch, working hours 0900-1700 (with no expectation to do more).

Requirements

You have strong experience with training deep learning models in production
You have an indepth knowledge of PyTorch including hands-on experience with torch.distributed (DDP/FSDP-style training, distributed data loading, gradient scaling, etc.)
You have experience of training large sequence models or LLMs
You have a software engineering background with Python, also familiar with TypeScript and / or Golang
You have distributed systems / training ops experience including practical experience running multi-node jobs on GPU clusters (Slurm, Kubernetes, or managed cloud equivalents) and are familiar with GPU performance tuning: memory usage, mixed precision, throughput vs. latency tradeoffs
You're collaborative with great communication skills