Chris Heilmann, Daniel Cranney, Marius Obert & Staff Developer Evangelist at Twilio

Minimal infrastructure for Real‑Time Phone Agents: transcripts in, responses out

Forget complex audio pipelines. Build a real-time AI phone agent with a simple text-in, text-out WebSocket and your favorite LLM.

Minimal infrastructure for Real‑Time Phone Agents: transcripts in, responses out
#1about 4 minutes

Why voice is a powerful and natural AI interface

Voice interaction is significantly faster for input than typing and allows for hands-free operation, making it a natural fit for many AI use cases despite the challenges of audio parsing.

#2about 1 minute

The complexity of building traditional voice agents

Building a voice agent the traditional way requires managing separate services for speech recognition, text-to-speech, and interruption detection, which introduces significant latency and complexity.

#3about 4 minutes

Simplifying voice agent architecture with ConversationRelay

Twilio's ConversationRelay abstracts away the complexities of voice processing, allowing developers to receive text transcripts via a WebSocket and focus solely on their application logic.

#4about 6 minutes

Live coding a Deno server for a phone agent

A basic Deno server is set up to handle initial HTTP requests by returning TwiML instructions and to upgrade the connection to a WebSocket for real-time communication.

#5about 3 minutes

Configuring a Twilio number and testing the connection

A new phone number is purchased and configured in the Twilio console to point to the server's webhook, followed by a live call to test the transcription and hardcoded response.

#6about 5 minutes

Integrating OpenAI for streaming dynamic responses

The OpenAI API is integrated to generate dynamic responses, using streaming to send text chunks back as they are generated to minimize perceived latency for the caller.

#7about 2 minutes

Adding conversational memory for context-aware replies

A simple map is used to store the last message ID for each WebSocket connection, enabling the OpenAI API to maintain conversational history for follow-up questions.

#8about 2 minutes

Final demo with AI, history, and interruption

The final demonstration showcases the fully functional AI phone agent handling a multi-turn conversation, remembering context, and allowing the user to interrupt its response.

Related jobs
Jobs that call for the skills explored in this talk.

Featured Partners

From learning to earning

Jobs that call for the skills explored in this talk.