Lee Boonstra
Raise your voice!
#1about 1 minute
Building a custom voice AI with WebRTC and Google APIs
An overview of the architecture for streaming voice from a browser to a backend for processing with conversational AI.
#2about 4 minutes
Comparing custom voice AI to public assistants
A custom voice AI provides more control over technical requirements and terms of service compared to public platforms like Google Assistant or Alexa.
#3about 1 minute
Handling short versus long user utterances
Public assistants are optimized for short commands, whereas custom AI for use cases like contact centers must be designed to handle long, complex user stories.
#4about 3 minutes
Demo of a voice-enabled self-service kiosk
A demonstration of a web-based airport kiosk that answers user questions spoken in different languages using a custom voice AI.
#5about 1 minute
The core challenge of integrating voice technologies
The main difficulty in building a voice AI is not using individual APIs, but integrating the entire pipeline from frontend audio stream to backend processing.
#6about 3 minutes
Capturing cross-browser microphone audio with RecordRTC
The RecordRTC library is used to abstract away browser inconsistencies and reliably capture microphone audio streams for processing.
#7about 2 minutes
Streaming audio to the backend with Socket.IO
Socket.IO and the socket.io-stream module enable real-time, bidirectional streaming of binary audio data from the browser to a Node.js backend.
#8about 3 minutes
Transcribing audio with the Speech-to-Text API
Google's Speech-to-Text API converts the incoming audio stream into text using a streaming recognition call that handles data as it arrives.
#9about 4 minutes
Understanding user intent with Dialogflow
Dialogflow uses natural language understanding to match transcribed user text to predefined intents, entities, and knowledge bases to determine the user's goal.
#10about 4 minutes
Adding multi-language support with the Translate API
The Translate API enables multi-language support by translating foreign language input to English for Dialogflow processing and then translating the response back.
#11about 3 minutes
Generating audio responses with Text-to-Speech
The Text-to-Speech API synthesizes a natural-sounding voice from the text response, which is then sent back to the browser as an audio buffer to be played.
#12about 1 minute
Deployment considerations and open source code
Deploying a voice application requires HTTPS for microphone access, which can be easily configured using services like App Engine Flex, and the full project code is available on GitHub.
Related jobs
Jobs that call for the skills explored in this talk.
Featured Partners
Related Videos
Creating bots with Dialogflow CX
Xavier Portilla Edo
Minimal infrastructure for Real‑Time Phone Agents: transcripts in, responses out
Chris Heilmann, Daniel Cranney & Marius Obert, Staff Developer Evangelist at Twilio
WeAreDevelopers LIVE – Real-Time Phone Agents, Unsafe VPNs & More
Chris Heilmann, Daniel Cranney & Marius Obert
OpenAI for FinTech: Building a Stock Market Advisor Chatbot
Akmal Chaudhri
From Syntax to Singularity: AI’s Impact on Developer Roles
Anna Fritsch-Weninger
Inside the AI Revolution: How Microsoft is Empowering the World to Achieve More
Simi Olabisi
From ML to LLM: On-device AI in the Browser
Nico Martin
Integrate your Cognitive Assistant with 3rd-party DBs and software
Felix Augenstein
From learning to earning
Jobs that call for the skills explored in this talk.
NodeJS Software Engineer - Conversational AI
MANGO
Palau-solità i Plegamans, Spain
API
Azure
Redis
Node.js
Salesforce
+6
Fullstack Developer (AI-Native Builder)
LEECON TS \u002F A-Leecon
Remote
API
Next.js
Firebase
JavaScript
+2





