Performant Architecture for a Fast Gen AI User Experience
Stop blaming slow models for your AI app's latency. Your architecture is the real problem.
#1about 2 minutes
Building a real-time translator inspired by sci-fi
The Babel fish from "Hitchhiker's Guide to the Galaxy" serves as the inspiration for a real-time audio translation project.
#2about 4 minutes
Analyzing the latency of a basic AI architecture
A demonstration of the initial 2019 architecture using GCloud reveals a significant latency of over ten seconds for a simple translation.
#3about 2 minutes
Reducing latency by upgrading the AI service stack
Switching to modern, specialized APIs like Deepgram and 11 Labs significantly cuts the total processing time from twelve to five seconds.
#4about 2 minutes
Implementing streaming to reduce response wait times
Adopting a streaming approach provides a major performance boost, but a naive implementation results in chaotic and low-quality audio output.
#5about 2 minutes
Using chunking to balance streaming speed and quality
Chunking data based on sentence punctuation controls the streaming waterfall, improving the quality of generated audio without sacrificing speed.
#6about 6 minutes
Eliminating network latency with local and edge models
Running a smaller, local AI model like Whisper on the edge eliminates cross-continental network latency and provides near-instantaneous results.
#7about 3 minutes
Using caching to serve pre-generated AI responses
Implementing caching, from simple request matching to semantic search with vector databases, avoids redundant generation and speeds up common queries.
#8about 2 minutes
Optimizing prompts and user experience for speed
Fine-tuning performance involves optimizing prompts to generate fewer tokens and improving perceived speed with clear loading states for the user.
#9about 2 minutes
Summary of key performance optimization techniques
A final recap covers the essential strategies for building fast Gen AI experiences, including streaming, edge computing, caching, and prompt optimization.
Related jobs
Jobs that call for the skills explored in this talk.
The Web We Broke (And Why AI Agents Are Paying the Price) - AgentCon BerlinThis is the accompanying post to the talk Chris Heilmann gave at AgentCon in Berlin on 19/05/2026, you can also see the slides and listen to it in this screencast:
Thirty years of developer shortcuts, bloated JavaScript, and inaccessible HTML have l...
Daniel Cranney
How to Use Generative AI to Accelerate Learning to CodeIt’s undeniable that generative-AI and LLMs have transformed how developers work. Hours of hunting Stack Overflow can be avoided by asking your AI-code assistant, multi-file context can be fed to the AI from inside your IDE, and applications can be b...
Daniel Cranney
Stephan Gillich - Bringing AI EverywhereIn the ever-evolving world of technology, AI continues to be the frontier for innovation and transformation. Stephan Gillich, from the AI Center of Excellence at Intel, dove into the subject in a recent session titled "Bringing AI Everywhere," sheddi...
Adrien Book
How AI Will Eat The World 🤖Of generative-AI-for-everything and synthetic pleasuresRemember the web3 hype? Tech bros with easy access to cheap liquidity wanted to create a decentralised, peer-to-peer internet powered by blockchain technology. Spoiler alert, it did not work. And...
From learning to earning
Jobs that call for the skills explored in this talk.