Lee Boonstra

Jun 17, 2020 • WeAreDevelopers LIVE

Raise your voice!

The hardest part of building a voice AI isn't the AI. It's making all the tools work together in real-time. Here's the complete blueprint.

#1about 1 minute

Building a custom voice AI with WebRTC and Google APIs

An overview of the architecture for streaming voice from a browser to a backend for processing with conversational AI.

#2about 4 minutes

Comparing custom voice AI to public assistants

A custom voice AI provides more control over technical requirements and terms of service compared to public platforms like Google Assistant or Alexa.

#3about 1 minute

Handling short versus long user utterances

Public assistants are optimized for short commands, whereas custom AI for use cases like contact centers must be designed to handle long, complex user stories.

#4about 3 minutes

Demo of a voice-enabled self-service kiosk

A demonstration of a web-based airport kiosk that answers user questions spoken in different languages using a custom voice AI.

#5about 1 minute

The core challenge of integrating voice technologies

The main difficulty in building a voice AI is not using individual APIs, but integrating the entire pipeline from frontend audio stream to backend processing.

#6about 3 minutes

Capturing cross-browser microphone audio with RecordRTC

The RecordRTC library is used to abstract away browser inconsistencies and reliably capture microphone audio streams for processing.

#7about 2 minutes

Streaming audio to the backend with Socket.IO

Socket.IO and the socket.io-stream module enable real-time, bidirectional streaming of binary audio data from the browser to a Node.js backend.

#8about 3 minutes

Transcribing audio with the Speech-to-Text API

Google's Speech-to-Text API converts the incoming audio stream into text using a streaming recognition call that handles data as it arrives.

#9about 4 minutes

Understanding user intent with Dialogflow

Dialogflow uses natural language understanding to match transcribed user text to predefined intents, entities, and knowledge bases to determine the user's goal.

#10about 4 minutes

Adding multi-language support with the Translate API

The Translate API enables multi-language support by translating foreign language input to English for Dialogflow processing and then translating the response back.

#11about 3 minutes

Generating audio responses with Text-to-Speech

The Text-to-Speech API synthesizes a natural-sounding voice from the text response, which is then sent back to the browser as an audio buffer to be played.

#12about 1 minute

Deployment considerations and open source code

Deploying a voice application requires HTTPS for microphone access, which can be easily configured using services like App Engine Flex, and the full project code is available on GitHub.