Researcher in AI Computing Systems
Role details
Job location
Tech stack
Job description
A research-driven technology organization is seeking a Senior Researcher in AI Computing Systems to advance the efficiency of large language model (LLM) inference and retrieval-augmented generation (RAG) pipelines.
This role operates at the intersection of systems research and low-level performance engineering, focusing on optimizing attention mechanisms, KV-cache strategies, and end-to-end inference stacks. The position involves translating cutting-edge research into high-performance, production-ready implementations., LLM Inference Optimization
- Design and implement techniques to reduce inference latency and improve throughput, including:
- KV-cache precomputation
- Cache reuse and blending strategies
- Efficient batching and scheduling
- Optimize time-to-first-token (TTFT) and overall system efficiency.
KV-Cache Systems & Memory Optimization
- Develop and integrate KV-cache reuse and blending pipelines into inference systems.
- Design caching policies including:
- Paging and eviction strategies
- Memory layout optimization
- Trade-offs between accuracy and performance
- Ensure correctness and stability under high-throughput workloads.
Attention Mechanism Optimization
- Implement and optimize sparse and selective attention techniques.
- Develop efficient masking strategies and block-level computation methods.
- Work closely with attention kernels to maximize hardware utilization.
Low-Level Performance Engineering
- Profile and optimize model execution using modern attention backends and kernel frameworks.
- Work with:
- PyTorch internals
- High-performance attention kernels (e.g., FlashAttention-style implementations)
- Identify and resolve performance bottlenecks across compute and memory subsystems.
Research Translation & Innovation
- Stay current with advances in LLM inference, caching systems, and RAG architectures.
- Translate research ideas into robust, scalable implementations.
- Contribute to internal innovation and potentially to external publications or open-source projects.
Requirements
- PhD in Computer Science, Electrical Engineering, or a related field.
- Strong software engineering skills in Python, with deep experience in PyTorch.
- Solid understanding of transformer inference, including:
- Prefill vs decode stages
- KV-cache structure and memory layout
- Masking and batching strategies
- Latency vs throughput trade-offs
- Experience with benchmarking and profiling large-scale LLM workloads.
- Ability to diagnose and resolve performance bottlenecks.
- Strong communication skills and ability to collaborate across research and engineering teams.
Preferred Qualifications
- Experience working with modern LLM inference frameworks (e.g., vLLM-like systems or similar).
- Familiarity with attention kernel development and optimization:
- CUDA, Triton, or custom kernel implementations
- Experience building or optimizing RAG pipelines, including:
- Retrieval and indexing
- Chunking and reranking
- Interaction between retrieval and inference latency
- Contributions to open-source projects or publications in AI systems or ML infrastructure.
- Systems-level expertise, including:
- Linux environments
- Memory hierarchy and storage systems
- Performance engineering close to hardware
Personal Attributes
- Strong systems-thinking mindset with attention to performance and scalability.
- Ability to bridge research concepts and production engineering.
- Detail-oriented with a focus on measurable performance improvements.
- Collaborative approach in multidisciplinary environments.
- Curiosity and drive to explore emerging AI infrastructure techniques.