Machine Learning Inference Engineer

Oscar Technology
San Francisco, United States of America
5 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

San Francisco, United States of America

Tech stack

Artificial Intelligence
Computer Vision
Program Optimization
Nvidia CUDA
Python
Language Modeling
Performance Tuning
Graphics Processing Unit (GPU)
PyTorch
Generative AI
Low Latency
Machine Learning Operations
TensorRT
Microservices

Job description

We are partnering with a fast-growing AI startup building next-generation multimodal generative systems focused on highly realistic visual experiences at scale. The company operates at the intersection of computer vision, generative AI, and real-time inference infrastructure, developing advanced AI products used by enterprise customers across large consumer-facing industries.

This is a highly technical and hands-on engineering role focused on production inference optimization for multimodal and generative AI systems. The ideal candidate will have deep expertise in GPU inference, model serving, PyTorch-based deployment, and performance optimization for large-scale AI applications.

The role offers significant ownership across infrastructure, inference systems, and production model optimization, with opportunities to contribute to novel AI system design and scalable deployment architectures.

What You'll Work On Build and optimize high-performance inference-serving systems for multimodal and generative AI models Improve latency, throughput, scalability, and GPU utilization for production AI workloads Productionize large PyTorch-based models for real-world deployment environments Design and maintain model-serving microservices and distributed inference infrastructure Optimize inference pipelines using: TensorRT Triton Inference Server vLLM CUDA/GPU acceleration techniques

Work on: KV cache optimization model pruning quantization distillation batching strategies memory optimization latent-space conditioning Deploy and scale multimodal architectures including: diffusion models vision-language models (VLMs) large vision pipelines Collaborate closely with research and product engineering teams to balance: model quality latency infrastructure cost production reliability Own the full inference optimization lifecycle from experimentation to production deployment

Ideal Background Strong experience building and optimizing AI inference systems in production Deep understanding of GPU architecture and performance optimization Hands-on expertise with: Python PyTorch CUDA TensorRT Triton vLLM

Requirements

Experience with multimodal AI, computer vision, or generative AI systems Familiarity with diffusion models or large-scale vision pipelines is strongly preferred Strong understanding of model deployment tradeoffs: throughput vs latency memory efficiency model quality vs compute cost Experience working with distributed inference systems and scalable serving infrastructure Comfortable operating in highly autonomous, fast-moving startup environments

Benefits & conditions

diffusion model optimization multimodal transformers quantization techniques FlashAttention TensorRT-LLM speculative decoding model parallelism Kubernetes-based ML infrastructure Contributions to open source AI infrastructure projects Publications, patents, or research experience in AI systems, vision, or generative modeling

Why This Opportunity Work on cutting-edge multimodal and generative AI systems deployed at scale Significant ownership and autonomy across core AI infrastructure Opportunity to solve complex GPU inference and scaling challenges High-impact engineering role with direct visibility into product performance Fast-moving environment with strong technical talent density Opportunity to contribute to novel IP and patentable systems

80% covered healthcare, 401k 3% matching, $500 learning stipend, Global program- work anywhere in the world for 3 months

Apply for this position