Machine Learning Inference Engineer
Role details
Job location
Tech stack
Job description
We are partnering with a fast-growing AI startup building next-generation multimodal generative systems focused on highly realistic visual experiences at scale. The company operates at the intersection of computer vision, generative AI, and real-time inference infrastructure, developing advanced AI products used by enterprise customers across large consumer-facing industries.
This is a highly technical and hands-on engineering role focused on production inference optimization for multimodal and generative AI systems. The ideal candidate will have deep expertise in GPU inference, model serving, PyTorch-based deployment, and performance optimization for large-scale AI applications.
The role offers significant ownership across infrastructure, inference systems, and production model optimization, with opportunities to contribute to novel AI system design and scalable deployment architectures.
What You'll Work On Build and optimize high-performance inference-serving systems for multimodal and generative AI models Improve latency, throughput, scalability, and GPU utilization for production AI workloads Productionize large PyTorch-based models for real-world deployment environments Design and maintain model-serving microservices and distributed inference infrastructure Optimize inference pipelines using: TensorRT Triton Inference Server vLLM CUDA/GPU acceleration techniques
Work on: KV cache optimization model pruning quantization distillation batching strategies memory optimization latent-space conditioning Deploy and scale multimodal architectures including: diffusion models vision-language models (VLMs) large vision pipelines Collaborate closely with research and product engineering teams to balance: model quality latency infrastructure cost production reliability Own the full inference optimization lifecycle from experimentation to production deployment
Ideal Background Strong experience building and optimizing AI inference systems in production Deep understanding of GPU architecture and performance optimization Hands-on expertise with: Python PyTorch CUDA TensorRT Triton vLLM
Requirements
Experience with multimodal AI, computer vision, or generative AI systems Familiarity with diffusion models or large-scale vision pipelines is strongly preferred Strong understanding of model deployment tradeoffs: throughput vs latency memory efficiency model quality vs compute cost Experience working with distributed inference systems and scalable serving infrastructure Comfortable operating in highly autonomous, fast-moving startup environments
Benefits & conditions
diffusion model optimization multimodal transformers quantization techniques FlashAttention TensorRT-LLM speculative decoding model parallelism Kubernetes-based ML infrastructure Contributions to open source AI infrastructure projects Publications, patents, or research experience in AI systems, vision, or generative modeling
Why This Opportunity Work on cutting-edge multimodal and generative AI systems deployed at scale Significant ownership and autonomy across core AI infrastructure Opportunity to solve complex GPU inference and scaling challenges High-impact engineering role with direct visibility into product performance Fast-moving environment with strong technical talent density Opportunity to contribute to novel IP and patentable systems
80% covered healthcare, 401k 3% matching, $500 learning stipend, Global program- work anywhere in the world for 3 months