Machine Learning Engineer, LLM Inference Optimization

GMI Cloud

Piedmont, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Piedmont, United States of America

Tech stack

Cloud Computing

Profiling

Software Debugging

Memory Management

Generalized Linear Model

Machine Learning

Open Source Technology

Performance Tuning

Graphics Processing Unit (GPU)

Autoscaling

Large Language Models

Low Latency

Free and Open-Source Software

Machine Learning Operations

TensorRT

Decoding

Job description

You will focus on B200-first optimization, with support for H200 evolution, across core domains including quantization, speculative decoding, KV cache and memory management, prefill/decode disaggregation, and system-level inference optimization. You will work closely with platform and infrastructure teams to transform cutting-edge ideas into measurable gains in latency, throughput, cost efficiency, and production scalability., * Drive frontier research and engineering in LLM inference optimization across one of the four focus tracks (Speculative Decoding, Quantization, PD Disaggregation, KV Cache & Memory) while contributing across the full optimization stack.

Develop next-generation optimization strategies for large-scale LLM serving across model execution, runtime systems, and production inference platforms - with B200 as the primary target and H200 as a continuing platform.
Advance state-of-the-art techniques in quantization (NVFP4 / MXFP4 / FP8, QAT), speculative decoding (EAGLE-3, MTP, DFlash, ModelOpt, SpecForge), KV cache & memory management (LMCache / HiCache / NV KVBM, paged attention, prefix-aware routing), and PD disaggregation (NVIDIA Dynamo, KV-aware router/planner, fault recovery).
Drive system-level optimization across scheduling, batching, routing, gateway orchestration, adapter serving, and end-to-end inference efficiency.
Build scalable optimization frameworks, performance methodologies, and benchmark infrastructure that allow GMI to stay ahead of the industry as models, hardware, and serving patterns evolve.
Productionize cutting-edge ideas into real customer workloads - measured by TTFT, ITL, throughput, goodput, tail latency, quality, and unit token cost.
Engage with and contribute to the open-source community (vLLM, SGLang, TensorRT-LLM, NVIDIA Dynamo / ModelOpt, FlashInfer, LMCache, etc.) - read upstream code, file issues, send PRs, and publish tech blogs and case studies.
Collaborate closely with platform, infrastructure, and product teams to make inference optimization a core technical advantage of GMI Cloud.

Requirements

Strong hands-on experience with LLM inference systems and performance optimization on modern GPUs.
Solid understanding of inference metrics and tradeoffs, including TTFT, ITL, throughput, goodput, tail latency, GPU utilization, memory efficiency, and quality/cost tradeoffs.
Experience with one or more modern serving stacks such as SGLang, vLLM, TensorRT-LLM, NVIDIA Dynamo, or Triton.
Deep familiarity with GPU-based inference, model serving architecture, and production bottlenecks around compute, memory bandwidth, KV-cache behavior, and scheduling.
Demonstrable depth in at least one of the four focus areas: speculative decoding, quantization & precision, PD disaggregation, or KV cache & memory management.
Strong experimentation skills: able to design benchmarks, interpret results, debug regressions, and produce actionable conclusions rather than isolated microbenchmark wins.
Proficient with Claude Code at an advanced level - fluent with sub-agents, MCP servers, hooks, custom slash commands, and skills - with practical experience leveraging them for rapid iteration, profiling, observability, and performance debugging.
Clear communication - able to explain technical tradeoffs to engineers and cross-functional stakeholders, and willing to publish results externally., * 2+ years of hands-on experience in LLM inference optimization, ML systems optimization, or PhD degree in related areas.
Track record of large-scale model serving optimization (latency reduction, throughput improvement, memory efficiency, cost-performance tuning) in production.
Specific track depth in one or more of:
Speculative Decoding: EAGLE-3 / MTP / DFlash / Medusa / SpecForge / ModelOpt; experience training and shipping draft models for production.
Quantization & Precision: NVFP4 / MXFP4 / FP8 / INT4-AWQ / GPTQ; QAT pipelines on Blackwell or Hopper; rigorous accuracy benchmarking.
PD Disaggregation: NVIDIA Dynamo, KV-aware router/planner, large MoE serving (DeepSeek-V3/V4, Kimi, GLM, Minimax), fault recovery, autoscaling.
KV Cache & Memory: LMCache / HiCache / NV KVBM, paged attention internals, prefix-aware routing, long-context and agentic workloads.
Familiarity with FlashInfer, Blackwell MLA, FA4, TRT-LLM MLA, or NSA is a strong plus.
Open-source contributions to vLLM, SGLang, TensorRT-LLM, NVIDIA Dynamo / ModelOpt, FlashInfer, LMCache, or related projects.
Experience publishing technical blogs, case studies, or papers on inference optimization.

About the company

GMI Cloud is a fast-growing AI infrastructure company backed by Headline VC and one of only seven cloud providers worldwide to earn NVIDIA's prestigious Reference Platform Cloud Partner designation. We operate 8 of our own GPU clusters across the U.S. and Asia, delivering a full spectrum of services from GPU compute to AI model inference API solutions. As an NVIDIA Reference Platform Cloud Partner, our infrastructure meets the highest standards for performance, security, and scalability in AI deployments. We empower AI startups and enterprises to "build AI without limits," providing everything they need to prototype, train, and deploy AI models quickly and reliably.

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all