Researcher in AI Computing Systems

Eu Recruit
2 days ago

Role details

Contract type
Permanent contract
Employment type
Part-time (≤ 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Tech stack

Artificial Intelligence
Nvidia CUDA
Linux kernel
Linux System Administration
Large Language Models
Machine Learning Operations

Job description

A research-driven technology organization is seeking a Senior Researcher in AI Computing Systems to advance the efficiency of large language model (LLM) inference and retrieval-augmented generation (RAG) pipelines.

This role operates at the intersection of systems research and low-level performance engineering, focusing on optimizing attention mechanisms, KV-cache strategies, and end-to-end inference stacks. The position involves translating cutting-edge research into high-performance, production-ready implementations., LLM Inference Optimization

  • Design and implement techniques to reduce inference latency and improve throughput, including:
  • KV-cache precomputation
  • Cache reuse and blending strategies
  • Efficient batching and scheduling
  • Optimize time-to-first-token (TTFT) and overall system efficiency.

KV-Cache Systems & Memory Optimization

  • Develop and integrate KV-cache reuse and blending pipelines into inference systems.
  • Design caching policies including:
  • Paging and eviction strategies
  • Memory layout optimization
  • Trade-offs between accuracy and performance
  • Ensure correctness and stability under high-throughput workloads.

Attention Mechanism Optimization

  • Implement and optimize sparse and selective attention techniques.
  • Develop efficient masking strategies and block-level computation methods.
  • Work closely with attention kernels to maximize hardware utilization.

Low-Level Performance Engineering

  • Profile and optimize model execution using modern attention backends and kernel frameworks.
  • Work with:
  • PyTorch internals
  • High-performance attention kernels (e.g., FlashAttention-style implementations)
  • Identify and resolve performance bottlenecks across compute and memory subsystems.

Research Translation & Innovation

  • Stay current with advances in LLM inference, caching systems, and RAG architectures.
  • Translate research ideas into robust, scalable implementations.
  • Contribute to internal innovation and potentially to external publications or open-source projects.

Requirements

  • PhD in Computer Science, Electrical Engineering, or a related field.
  • Strong software engineering skills in Python, with deep experience in PyTorch.
  • Solid understanding of transformer inference, including:
  • Prefill vs decode stages
  • KV-cache structure and memory layout
  • Masking and batching strategies
  • Latency vs throughput trade-offs
  • Experience with benchmarking and profiling large-scale LLM workloads.
  • Ability to diagnose and resolve performance bottlenecks.
  • Strong communication skills and ability to collaborate across research and engineering teams.

Preferred Qualifications

  • Experience working with modern LLM inference frameworks (e.g., vLLM-like systems or similar).
  • Familiarity with attention kernel development and optimization:
  • CUDA, Triton, or custom kernel implementations
  • Experience building or optimizing RAG pipelines, including:
  • Retrieval and indexing
  • Chunking and reranking
  • Interaction between retrieval and inference latency
  • Contributions to open-source projects or publications in AI systems or ML infrastructure.
  • Systems-level expertise, including:
  • Linux environments
  • Memory hierarchy and storage systems
  • Performance engineering close to hardware

Personal Attributes

  • Strong systems-thinking mindset with attention to performance and scalability.
  • Ability to bridge research concepts and production engineering.
  • Detail-oriented with a focus on measurable performance improvements.
  • Collaborative approach in multidisciplinary environments.
  • Curiosity and drive to explore emerging AI infrastructure techniques.

Apply for this position