Systems Research Engineer
European Tech Recruit
Edinburgh, United Kingdom
3 days ago
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
EnglishJob location
Edinburgh, United Kingdom
Tech stack
Artificial Intelligence
Distributed Systems
Fault Tolerance
Systems Theories
Python
Rapid Prototyping Process
AI Infrastructure
Load Balancing
PyTorch
Large Language Models
TensorRT
Job description
- Distributed Systems R&D: Architecting components for CPU, GPU, and NPU clusters with a focus on modularity and extreme scalability.
- Performance Engineering: In-depth profiling of large-scale inference pipelines, specifically focusing on KV cache management and heterogeneous memory scheduling.
- AI Serving: Optimising high-throughput frameworks (vLLM, Ray Serve, PyTorch Distributed) to ensure low-latency, multi-tenant performance.
- Research Leadership: Contributing to top-tier venues (OSDI, NSDI, EuroSys, MLSys) and driving those innovations into real-world production.
Who You Are
We are looking for "systems-first" thinkers-engineers who understand what happens under the hood of a cluster.
Requirements
- Education: A Bachelor's or Master's in CS, EE, or a related field (PhD highly preferred).
- The Stack: Strong proficiency in C/C++ for systems work, with Python for rapid prototyping.
- Expertise: Hands-on experience with LLM serving frameworks (vLLM, Ray Serve, TensorRT-LLM) and distributed algorithms.
- Mindset: A solid grounding in systems research methodology and performance profiling tools.
The "Value Add" (Desired):
- A PhD focused on distributed computing or AI infrastructure.
- A track record of publications at major conferences (NeurIPS, ICML, ICLR, etc.).
- Deep knowledge of load balancing, fault tolerance, and resource orchestration in massive AI clusters.
About the company
One of the largest telecommunications companies in the world is looking for an experienced researcher to join the company in Edinburgh.
The Vision
We are currently scaling a world-class research team in Edinburgh to redefine the foundational software stack for the LLM era. As AI transitions from experimental to "agentic" and "AI-native" infrastructure, we are building the super-node clusters and distributed architectures that will power the next generation of global data centres.
This is a unique hybrid role positioned at the intersection of academic-grade systems research and industrial-scale engineering. You won't just be writing papers; you'll be prototyping and deploying the frameworks that manage GPU/NPU clusters at a massive scale.