Machine Learning Engineer

Client Server
Charing Cross, United Kingdom
4 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Compensation
£ 110K

Job location

Remote
Charing Cross, United Kingdom

Tech stack

Software Debugging
Distributed Data Store
Distributed Systems
Python
Machine Learning
Open Source Technology
Performance Tuning
Software Engineering
TypeScript
Reinforcement Learning
PyTorch
Large Language Models
Prompt Engineering
Deep Learning
Kubernetes
Slurm
Go

Job description

As a Machine Learning Engineer you'll take open-source LLMs (code and general models) and turn them into high-performance software engineer agents using supervised fine tuning and large scale reinforcement learning. This isn't prompt engineering. You'll design and run serious training experiments across multi-node GPU clusters, build RL loops where models write code and get rewarded (or penalised) by real test outcomes and push long-context and MoE style architectures to their limits.

You'll work hands-on across the full stack: custom PyTorch dataloaders, distributed training (DDP/FSDP), experiment tracking, debugging NCCL issues at 2am, and squeezing performance out of multi-GPU jobs. You'll help design opinionated reward functions that reflect what great engineering actually looks like, not just benchmark scores.

You'll extend benchmark suites, test models on real world repositories, analyse failure modes and feed insights back into data and training strategy. Collaborating with infrastructure, product and research teams you'll contribute to decisions about what to train next and how to measure results.

Location / WFH:

You'll be based in the London, dog friendly office on a fulltime basis, with daily catered lunch, working hours 0900-1700 (with no expectation to do more).

Requirements

  • You have strong experience with training deep learning models in production
  • You have an indepth knowledge of PyTorch including hands-on experience with torch.distributed (DDP/FSDP-style training, distributed data loading, gradient scaling, etc.)
  • You have experience of training large sequence models or LLMs
  • You have a software engineering background with Python, also familiar with TypeScript and / or Golang
  • You have distributed systems / training ops experience including practical experience running multi-node jobs on GPU clusters (Slurm, Kubernetes, or managed cloud equivalents) and are familiar with GPU performance tuning: memory usage, mixed precision, throughput vs. latency tradeoffs
  • You're collaborative with great communication skills

Benefits & conditions

  • Salary to £110k
  • Equity / stack options
  • 30 days holiday (+ Bank Holidays)
  • Daily lunch, monthly breakfasts
  • Dog friendly office
  • Pension
  • Monthly socials
  • Impactful role that you can shape and influence

About the company

Managed by: Data Team

Apply for this position