Semantic Caching: The Secret to 50ms LLM Response Times
Deep dive into how semantic caching with embedding similarity can reduce latency by 95% and costs by 100% for repeated queries.
B2ALABS Engineering
Published on
The Performance Problem with LLM APIs
Large Language Model (LLM) APIs like OpenAI's GPT-5, Anthropic's Claude, and Google's Gemini are incredibly powerful,
but they come with two significant challenges: high latency (typically 1-3 seconds per request) and
high cost ($0.03+ per 1K tokens). For applications serving thousands of users, these constraints
can make or break the user experience and business viability.
What is Semantic Caching?
Semantic caching is an intelligent caching strategy that goes beyond traditional exact-match caching.
Instead of only returning cached results for identical queries, semantic caching uses embedding
similarity to identify when two queries are semantically similar—even if the exact wording differs.
Traditional Caching vs. Semantic Caching
Traditional Cache:
"What's the weather in SF?" - Cache Miss
"What's the weather in San Francisco?" - Cache Miss
(Two different queries, no cache hit despite same intent)
Semantic Cache:
"What's the weather in SF?" - Cache Miss (first time)
"What's the weather in San Francisco?" - Cache Hit! (95% similarity)
B2ALABS® implements semantic caching using a four-step process:
Step 1: Generate Embeddings
When a new prompt arrives, we generate a 384-dimensional embedding vector using a fast embedding model
like all-MiniLM-L6-v2. This vector represents the semantic meaning of the prompt.
Prompt: "Explain machine learning to a 5-year-old"
Embedding: [0.23, -0.15, 0.67, ..., 0.89] (384 dimensions)
Step 2: Search Vector Database
We store all prompt embeddings in a vector database (Redis with vector similarity search or Pinecone).
For the new prompt, we search for the most similar cached embeddings using cosine similarity.
If the similarity score exceeds our threshold (typically 95%), we consider it a cache hit and return
the cached LLM response instantly—no API call needed.
Step 4: Cache New Responses
If there's no cache hit (similarity < 95%), we call the LLM API, get the response, and cache both the
prompt embedding and the response for future queries.
Real-World Performance Gains
Here's what semantic caching achieves in production:
Latency Reduction: 95%
Without Caching:
LLM API call: 1,200ms average response time
With Semantic Caching:
Cache hit: 50ms average response time
(95% faster!)
Cost Savings: 100%
Cache hits cost $0. No LLM API call means no usage charges. With an average cache hit rate
of 34%, customers save over $10,000 per month on API costs alone.
Cache Hit Rate Analysis
Customer Support Chatbot:
- Cache Hit Rate: 47%
- Reason: Common questions asked repeatedly
- Annual Savings: $28,000
Content Generation API:
- Cache Hit Rate: 12%
- Reason: Highly unique prompts
- Annual Savings: $7,200
Documentation Q&A:
- Cache Hit Rate: 64%
- Reason: Limited question set
- Annual Savings: $45,000
Implementation with B2ALABS®
Enabling semantic caching with B2ALABS® AI Gateway takes just a few configuration changes:
The real magic happens when semantic caching is combined with B2ALABS®' multi-provider routing:
Check semantic cache first (50ms if hit)
If cache miss: Route to cheapest healthy provider
Cache the response with its embedding
Future similar queries: Instant cache hit
This two-layer optimization delivers 70-95% cost savings and 10-20x faster
response times for typical workloads.
Case Study: E-Commerce Product Recommendations
An e-commerce platform implemented B2ALABS® semantic caching for their AI-powered product recommendation engine:
Before B2ALABS®
Average response time: 1,850ms
Monthly API cost: $18,500
Cache hit rate: 0% (no caching)
After B2ALABS®
Average response time: 280ms (85% faster)
Monthly API cost: $4,200 (77% savings)
Cache hit rate: 52%
Customer satisfaction: +23% (faster responses)
Getting Started
Enable semantic caching in your B2ALABS® deployment:
# docker-compose.yml
services:
redis:
image: redis/redis-stack:latest
volumes:
- redis-data:/data
gateway:
image: b2alabs/gateway:latest
environment:
- SEMANTIC_CACHE_ENABLED=true
- REDIS_URL=redis://redis:6379
# Start the stack
docker-compose up -d
# Test semantic caching
curl -X POST http://localhost:8080/api/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5-mini",
"messages": [{"role": "user", "content": "Explain AI"}]
}'
# Try a similar prompt
curl -X POST http://localhost:8080/api/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5-mini",
"messages": [{"role": "user", "content": "What is artificial intelligence?"}]
}'
# Check cache hit in response headers:
# X-Cache: HIT
# X-Cache-Similarity: 0.96
Conclusion
Semantic caching transforms LLM applications from slow and expensive to fast and cost-effective.
By understanding the semantic meaning of prompts rather than just exact text matches, B2ALABS®
achieves cache hit rates 3-5x higher than traditional caching—saving money and delighting users
with sub-100ms response times.