Aug 27, 2025

Reducing LLM Calls with Vector Search Patterns - Raphael De Lio (Redis)

Large context windows aren't the answer. Learn three vector search patterns to slash your LLM costs and latency.

#1about 3 minutes

The hidden costs of large LLM context windows

Large context windows in models like GPT-5 seem to eliminate the need for RAG, but the high token cost makes this approach expensive and unscalable for every request.

#2about 3 minutes

A brief introduction to vectors and vector search

Text is converted into numerical vector embeddings that capture its semantic meaning, allowing computers to efficiently calculate the similarity between different phrases or documents.

#3about 9 minutes

How to classify text using a vector database

Instead of using a costly LLM for every classification task, you can use a vector database to match new text against pre-embedded reference examples for a specific label.

#4about 5 minutes

Using semantic routing for efficient tool calling

By matching user prompts against pre-defined reference phrases for each tool, you can directly trigger the correct function without an initial, expensive LLM call.

#5about 5 minutes

Reducing latency and cost with semantic caching

Semantic caching stores LLM responses and serves them for new, semantically similar prompts, which avoids re-computation and significantly reduces both cost and latency.

#6about 7 minutes

Strategies for optimizing vector search accuracy

Improve the accuracy of vector search patterns through techniques like self-improvement, a hybrid approach that falls back to an LLM, and chunking complex prompts into smaller clauses.

#7about 3 minutes

Addressing advanced challenges in semantic caching

Mitigate common caching pitfalls, like misinterpreting negative prompts, by using specialized embedding models and combining semantic routing with caching to avoid caching certain types of queries.