Senior Software Engineer, LLM Performance

PARASAIL, LLC
Oakland, United States of America
6 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Oakland, United States of America

Tech stack

API
Artificial Intelligence
C++
Nvidia CUDA
General-Purpose Computing on Graphics Processing Units
Python
Open Source Technology
High Performance Computing
PyTorch
Large Language Models
Multi-Agent Systems
Kubernetes
Decoding

Job description

The Senior Software Engineer, LLM Performance plays a crucial role in delivering a competitive platform by focusing on efficiently scheduling, executing, and managing AI workloads on distributed compute systems. This role is deeply technical, spanning from low-level GPU kernels to distributed AI orchestration and Kubernetes (K8s) deployments. It is about more than optimization; it's about pioneering efficient infrastructure that supports AI's transformative role in reshaping productivity, revolutionizing industries, and addressing some of the world's most challenging problems. You'll ensure that generative AI - including large language models (LLMs), multi-modal models, and diffusion models - operates efficiently at enterprise scale while driving continuous improvements in cost, performance, and sustainability., * Add support for new LLMs, working across the stack from low-level GPU kernels to Kubernetes-based deployments.

  • Contribute to cutting-edge open-source LLM engines such as vLLM or SGLang to extend their capabilities and performance (e.g. use Python technologies to improve API servers or request schedulers).
  • Operate closer to the hardware, focusing on building and integrating solutions to boost performance and hardware utilization. For example, improve attention backends like FlashAttention or FlashInfer by contributing to their development and optimization, or by integrating their solutions into vLLM.
  • Improve LLM performance using advanced algorithmic solutions such as speculative decoding, quantization, or other state-of-the-art techniques. Understand the impact of such techniques in model quality.

Requirements

  • Expertise in GPU computing, including low-level platforms such as CUDA, ROCm, XLA, PyTorch, Jax, etc.
  • Background in performance analysis and optimization of AI/HPC workloads (e.g. profiling or theoretical analysis of Flops and bandwidth).
  • Experience in writing GPU kernels using technologies like CUDA, CUTLASS, Triton.
  • Strength in Python and C++.
  • Demonstrated contributions to open-source projects. Contributions to inference engines such as vLLM is a strong plus.
  • A production-oriented mindset emphasizing robust, scalable code suitable for enterprise-grade applications.
  • A relentless curiosity about cutting-edge AI technologies combined with a passion for solving complex problems.

About the company

Parasail is redefining AI infrastructure by enabling seamless deployment across a distributed network of GPUs, optimizing for cost, performance, and flexibility. Our mission is to empower AI developers with a fast, cost-efficient, and scalable cloud experience-free from vendor lock-in and designed for the next generation of AI workloads.

Apply for this position