AI Inference Engineer

Triune Infomatics Inc

San Jose, United States of America

12 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

San Jose, United States of America

Tech stack

Artificial Intelligence

Systems Engineering

C++

Computer Clusters

Profiling

Nvidia CUDA

Software Debugging

Distributed Systems

Fault Tolerance

Python

Node.js

Openshift

Rust

Datadog

Network Routers

Graphics Processing Unit (GPU)

Load Balancing

System Availability

Large Language Models

Kubernetes

Hardware Acceleration

Front End Software Development

TensorRT

Hardware Infrastructure

Api Design

Microservices

Job description

We are seeking a highly skilled AI Inference Engineer to join our team and drive the performance, scalability, and reliability of our large-scale model serving infrastructure. This role sits at the intersection of systems engineering, GPU optimization, and distributed infrastructure, and is ideal for someone who thrives on squeezing maximum performance out of production AI workloads., Inference Serving & Optimization

· Build, operate, and optimize production model-serving stacks using frameworks such as vLLM, SGLang, Triton Inference Server, TensorRT-LLM, TorchServe, or KServe

· Develop and maintain custom high-throughput microservices for model inference using C++, Python, and Rust

GPU & Hardware Acceleration

· Write and optimize custom GPU kernels using CUDA, ROCm, or Triton

· Apply deep understanding of GPU architecture, including memory hierarchies and tensor cores, to improve compute efficiency

LLM Inference Internals

· Optimize prefill and decode stages, attention mechanisms, and continuous batching

· Implement and tune quantization, speculative decoding, tensor parallelism, pipeline parallelism, and Mixture of Experts (MoE) serving strategies

Memory & KV Cache Management

· Design and implement KV cache optimization strategies, including PagedAttention, chunked prefill, prefix caching, and quantized KV

· Develop cache transfer and offload strategies to manage memory pressure under high-volume, irregular workloads

Distributed Systems & Infrastructure

· Build and operate fault-tolerant, high-concurrency serving systems deployed on Kubernetes, OpenShift, Helm, or similar orchestration platforms

· Implement tensor parallelism, pipeline parallelism, and distributed computing across multi-node, multi-GPU clusters

Distributed Serving Platform (Dynamo)

· Contribute to distributed serving architecture components including frontend, router, worker discovery, multi-model routing, and health checks

· Build and maintain OpenAI-compatible endpoints across multiple backends, including SGLang, TensorRT-LLM, and vLLM

Performance & Reliability

· Conduct deep profiling and benchmarking to identify and resolve latency and throughput regressions

· Build telemetry-driven observability platforms ensuring high availability, load balancing, and dynamic request scheduling

Model Support

· Bring up and support a broad range of model classes in production, including decoder-only LLMs, MoE models, hybrid attention/SSM models, multimodal models, embedding models, reward models, and classification models

Requirements

The ideal candidate has hands-on experience building or operating production-grade inference serving systems and is comfortable working close to the hardware, from CUDA/ROCm kernels to distributed multi-node, multi-GPU clusters serving large language models at scale., · Proven experience with production model-serving frameworks (vLLM, SGLang, Triton Inference Server, TensorRT-LLM, TorchServe, KServe, or custom runtimes)

· Strong proficiency in C++, Python, and Rust for building high-performance, memory-efficient systems

· Hands-on experience writing GPU kernels using CUDA and/or ROCm

· Solid understanding of LLM inference internals, including attention mechanisms, KV cache management, continuous batching, and quantization

· Experience with distributed, multi-node, multi-GPU serving environments

· Experience deploying and managing services on Kubernetes, OpenShift, or similar orchestration platforms

· Strong background in performance profiling, benchmarking, and debugging latency or throughput issues, · Direct experience working with NVIDIA Dynamo or similar distributed serving architectures (router, worker discovery, multi-model routing)

· Experience supporting diverse model types in production, including MoE, multimodal, and hybrid attention/SSM architectures

· Familiarity with OpenAI-compatible API design and implementation

· Experience with telemetry and observability tooling for large-scale GPU infrastructure

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all