Senior Software Engineer, Quantized Inference
Role details
Job location
Tech stack
Job description
We are now looking for a Senior Software Engineer for Quantized Inference! NVIDIA is seeking software engineers to accelerate the discovery and deployment of efficient inference recipes for LLMs. A recipe defines which operators are transformed into low-precision or sparsified variants - unlocking throughput and latency gains without regressing accuracy or verbosity. Recipes may incorporate techniques such as rotations, block scaling to attenuate outlier impact, or improved calibration data drawn from SFT/RL pipelines.
Each new recipe demands corresponding kernel and model-level implementations in inference engines (vLLM, TRT-LLM, SGLang). The candidate will translate recipe specifications into functionally correct, performant code, e.g., writing Triton kernels, inserting quantize/dequantize nodes into prefill and decode paths, and ensuring per-expert scaling in MoE layers is handled correctly. From there, the candidate will collaborate with partner inference teams to further optimize throughput and interactivity on target workloads. This work is a core component of our productization effort across Megatron-LM, ModelOpt, and vLLM.
What you'll be doing:
-
Implement quantized and sparse recipes in inference engines (vLLM, TRT-LLM, SGLang)
-
Own model export pipelines (ModelOpt, Megatron-LM <-> HuggingFace), ensuring quantized checkpoints serialize correctly for downstream serving
-
Build prototypes and benchmarking harnesses to evaluate recipe throughput/interactivity before full optimization
-
Develop data analysis tooling and visualizations for numerics debugging
-
Improve developer productivity across the team: CI, build systems, training infrastructure, pipeline friction
-
Participate in code reviews and incorporate feedback
What we need to see:
- Proficient in Python; familiarity with C+
Requirements
-
Strong software engineering fundamentals: concise, well-tested code; fluent with AI-assisted tooling
-
Experience with ML accelerators with a basic understanding of how certain ML layers affect execution time
-
Familiarity with PyTorch internals (custom ops, autograd, export) or equivalent framework
-
Experience reading, modifying, or contributing to a large open-source codebase
-
MS/PhD in Computer Science or related field, or equivalent experience.
-
4+ years in a relevant software engineering role
-
Demonstrated ability to move fast with ambiguous requirements, with strong written and verbal communication
Ways to stand out from the crowd:
-
Experience contributing to inference serving frameworks (vLLM, TRT-LLM, SGLang) or Triton kernel development
-
Track record of debugging numerical issues across mixed-precision boundaries
-
Deep experience with model compression techniques: PTQ, QAT, structured/unstructured sparsity
Benefits & conditions
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4.