Researcher in AI Computing Systems

Eu Recruit

1 month ago

Role details

Contract type

Permanent contract

Employment type

Part-time (≤ 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Tech stack

Artificial Intelligence

Nvidia CUDA

Linux kernel

Linux System Administration

Large Language Models

Machine Learning Operations

Job description

A research-driven technology organization is seeking a Senior Researcher in AI Computing Systems to advance the efficiency of large language model (LLM) inference and retrieval-augmented generation (RAG) pipelines.

This role operates at the intersection of systems research and low-level performance engineering, focusing on optimizing attention mechanisms, KV-cache strategies, and end-to-end inference stacks. The position involves translating cutting-edge research into high-performance, production-ready implementations., LLM Inference Optimization

Design and implement techniques to reduce inference latency and improve throughput, including:

KV-cache precomputation
Cache reuse and blending strategies
Efficient batching and scheduling

Optimize time-to-first-token (TTFT) and overall system efficiency.

KV-Cache Systems & Memory Optimization

Develop and integrate KV-cache reuse and blending pipelines into inference systems.
Design caching policies including:

Paging and eviction strategies
Memory layout optimization
Trade-offs between accuracy and performance

Ensure correctness and stability under high-throughput workloads.

Attention Mechanism Optimization

Implement and optimize sparse and selective attention techniques.
Develop efficient masking strategies and block-level computation methods.
Work closely with attention kernels to maximize hardware utilization.

Low-Level Performance Engineering

Profile and optimize model execution using modern attention backends and kernel frameworks.
Work with:

PyTorch internals
High-performance attention kernels (e.g., FlashAttention-style implementations)

Identify and resolve performance bottlenecks across compute and memory subsystems.

Research Translation & Innovation

Stay current with advances in LLM inference, caching systems, and RAG architectures.
Translate research ideas into robust, scalable implementations.
Contribute to internal innovation and potentially to external publications or open-source projects.

Requirements

PhD in Computer Science, Electrical Engineering, or a related field.
Strong software engineering skills in Python, with deep experience in PyTorch.
Solid understanding of transformer inference, including:

Prefill vs decode stages
KV-cache structure and memory layout
Masking and batching strategies
Latency vs throughput trade-offs

Experience with benchmarking and profiling large-scale LLM workloads.
Ability to diagnose and resolve performance bottlenecks.
Strong communication skills and ability to collaborate across research and engineering teams.

Preferred Qualifications

Experience working with modern LLM inference frameworks (e.g., vLLM-like systems or similar).
Familiarity with attention kernel development and optimization:

CUDA, Triton, or custom kernel implementations

Experience building or optimizing RAG pipelines, including:

Retrieval and indexing
Chunking and reranking
Interaction between retrieval and inference latency

Contributions to open-source projects or publications in AI systems or ML infrastructure.
Systems-level expertise, including:

Linux environments
Memory hierarchy and storage systems
Performance engineering close to hardware

Personal Attributes

Strong systems-thinking mindset with attention to performance and scalability.
Ability to bridge research concepts and production engineering.
Detail-oriented with a focus on measurable performance improvements.
Collaborative approach in multidisciplinary environments.
Curiosity and drive to explore emerging AI infrastructure techniques.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all