Semantic Caching: The Secret to 50ms LLM Response Times
Performance10 min read

Semantic Caching: The Secret to 50ms LLM Response Times

Deep dive into how semantic caching with embedding similarity can reduce latency by 95% and costs by 100% for repeated queries.

Published on

The Performance Problem with LLM APIs

Large Language Model (LLM) APIs like OpenAI's GPT-5, Anthropic's Claude, and Google's Gemini are incredibly powerful, but they come with two significant challenges: high latency (typically 1-3 seconds per request) and high cost ($0.03+ per 1K tokens). For applications serving thousands of users, these constraints can make or break the user experience and business viability.

What is Semantic Caching?

Semantic caching is an intelligent caching strategy that goes beyond traditional exact-match caching. Instead of only returning cached results for identical queries, semantic caching uses embedding similarity to identify when two queries are semantically similar—even if the exact wording differs.

Traditional Caching vs. Semantic Caching

Tags:#caching#performance#embeddings#optimization
Traditional Cache:
"What's the weather in SF?" - Cache Miss
"What's the weather in San Francisco?" - Cache Miss
(Two different queries, no cache hit despite same intent)
Semantic Cache:
"What's the weather in SF?" - Cache Miss (first time)
"What's the weather in San Francisco?" - Cache Hit! (95% similarity)
(Recognizes semantic similarity, returns cached response)

How Semantic Caching Works

B2ALABS® implements semantic caching using a four-step process:

Step 1: Generate Embeddings

When a new prompt arrives, we generate a 384-dimensional embedding vector using a fast embedding model like all-MiniLM-L6-v2. This vector represents the semantic meaning of the prompt.

Prompt: "Explain machine learning to a 5-year-old"
Embedding: [0.23, -0.15, 0.67, ..., 0.89] (384 dimensions)

Step 2: Search Vector Database

We store all prompt embeddings in a vector database (Redis with vector similarity search or Pinecone). For the new prompt, we search for the most similar cached embeddings using cosine similarity.

Query Embedding:   [0.23, -0.15, 0.67, ...]
Cached Embedding:  [0.25, -0.16, 0.65, ...]
Cosine Similarity: 0.97 (97% similar!)

Step 3: Threshold Check

If the similarity score exceeds our threshold (typically 95%), we consider it a cache hit and return the cached LLM response instantly—no API call needed.

Step 4: Cache New Responses

If there's no cache hit (similarity < 95%), we call the LLM API, get the response, and cache both the prompt embedding and the response for future queries.

Real-World Performance Gains

Here's what semantic caching achieves in production:

Latency Reduction: 95%

Without Caching:
LLM API call: 1,200ms average response time

With Semantic Caching:
Cache hit: 50ms average response time
(95% faster!)

Cost Savings: 100%

Cache hits cost $0. No LLM API call means no usage charges. With an average cache hit rate of 34%, customers save over $10,000 per month on API costs alone.

Cache Hit Rate Analysis

Customer Support Chatbot:
- Cache Hit Rate: 47%
- Reason: Common questions asked repeatedly
- Annual Savings: $28,000

Content Generation API:
- Cache Hit Rate: 12%
- Reason: Highly unique prompts
- Annual Savings: $7,200

Documentation Q&A:
- Cache Hit Rate: 64%
- Reason: Limited question set
- Annual Savings: $45,000

Implementation with B2ALABS®

Enabling semantic caching with B2ALABS® AI Gateway takes just a few configuration changes:

1. Enable Redis Vector Search

version: '3.8'
services:
  redis:
    image: redis/redis-stack:latest
    ports:
      - "6379:6379"
    environment:
      - REDIS_ARGS=--loadmodule /opt/redis-stack/lib/redisearch.so

2. Configure Caching in Gateway

SEMANTIC_CACHE_ENABLED=true
SEMANTIC_CACHE_SIMILARITY_THRESHOLD=0.95
SEMANTIC_CACHE_TTL=3600
EMBEDDING_MODEL=all-MiniLM-L6-v2

3. Monitor Cache Performance

B2ALABS® includes Grafana dashboards showing:

  • Cache hit rate: Percentage of requests served from cache
  • Average latency by source: Cache vs. API response times
  • Cost savings: Dollars saved from cache hits
  • Similarity score distribution: How close cache hits are to original prompts

Advanced: Fine-Tuning Cache Behavior

Different use cases require different similarity thresholds:

High Similarity Threshold (98%+)

Use Case: Legal documents, medical advice, financial calculations
Trade-off: Lower cache hit rate, but higher accuracy

Medium Similarity Threshold (95%)

Use Case: Customer support, general Q&A, documentation lookup
Trade-off: Balanced hit rate and accuracy (recommended default)

Lower Similarity Threshold (90%)

Use Case: Content recommendations, creative writing prompts
Trade-off: Higher cache hit rate, but potentially less precise responses

Cache Invalidation Strategies

Semantic caches should be invalidated when:

  1. TTL Expires: Automatically expire cached responses after N hours
  2. Content Updates: Clear cache when underlying data changes
  3. Model Upgrades: Invalidate cache when switching LLM versions
  4. Manual Override: Admin controls to clear specific cache entries

Technical Deep Dive: Embedding Generation

B2ALABS® uses all-MiniLM-L6-v2 for embedding generation because:

  • Fast: 5ms inference time on CPU
  • Accurate: 82.3% accuracy on semantic similarity benchmarks
  • Compact: 384 dimensions (vs. 1536 for OpenAI embeddings)
  • Cost-effective: Runs locally, no API costs
import { pipeline } from '@xenova/transformers';

const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const embedding = await embedder(prompt, { pooling: 'mean', normalize: true });

// Result: Float32Array(384) with normalized values
console.log(embedding.data); // [-0.023, 0.456, 0.789, ...]

Combining Semantic Cache with Smart Routing

The real magic happens when semantic caching is combined with B2ALABS®' multi-provider routing:

  1. Check semantic cache first (50ms if hit)
  2. If cache miss: Route to cheapest healthy provider
  3. Cache the response with its embedding
  4. Future similar queries: Instant cache hit

This two-layer optimization delivers 70-95% cost savings and 10-20x faster response times for typical workloads.

Case Study: E-Commerce Product Recommendations

An e-commerce platform implemented B2ALABS® semantic caching for their AI-powered product recommendation engine:

Before B2ALABS®

  • Average response time: 1,850ms
  • Monthly API cost: $18,500
  • Cache hit rate: 0% (no caching)

After B2ALABS®

  • Average response time: 280ms (85% faster)
  • Monthly API cost: $4,200 (77% savings)
  • Cache hit rate: 52%
  • Customer satisfaction: +23% (faster responses)

Getting Started

Enable semantic caching in your B2ALABS® deployment:

# docker-compose.yml
services:
  redis:
    image: redis/redis-stack:latest
    volumes:
      - redis-data:/data

  gateway:
    image: b2alabs/gateway:latest
    environment:
      - SEMANTIC_CACHE_ENABLED=true
      - REDIS_URL=redis://redis:6379

# Start the stack
docker-compose up -d

# Test semantic caching
curl -X POST http://localhost:8080/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5-mini",
    "messages": [{"role": "user", "content": "Explain AI"}]
  }'

# Try a similar prompt
curl -X POST http://localhost:8080/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5-mini",
    "messages": [{"role": "user", "content": "What is artificial intelligence?"}]
  }'

# Check cache hit in response headers:
# X-Cache: HIT
# X-Cache-Similarity: 0.96

Conclusion

Semantic caching transforms LLM applications from slow and expensive to fast and cost-effective. By understanding the semantic meaning of prompts rather than just exact text matches, B2ALABS® achieves cache hit rates 3-5x higher than traditional caching—saving money and delighting users with sub-100ms response times.

Ready to implement semantic caching? Check out our Getting Started guide or Semantic Caching course.

Connect with us:

Trademark Acknowledgments:

OpenAI®, GPT®, GPT-4®, GPT-5®, and ChatGPT® are trademarks of OpenAI, Inc. • Claude® and Anthropic® are trademarks of Anthropic, PBC. • Gemini™, Google™, and PaLM® are trademarks of Google LLC. • Meta®, Llama™, and Meta Llama™ are trademarks of Meta Platforms, Inc. • Mistral AI® is a trademark of Mistral AI. • Cohere® is a trademark of Cohere Inc. • Microsoft®, Azure®, and Azure OpenAI® are trademarks of Microsoft Corporation. • Amazon Web Services®, AWS®, and AWS Bedrock® are trademarks of Amazon.com, Inc. • Together AI™, Replicate®, and Perplexity® are trademarks of their respective owners. • All trademarks and registered trademarks are the property of their respective owners. B2ALABS® is not affiliated with, endorsed by, or sponsored by any of the aforementioned companies. Provider logos and names are used for identification purposes only under fair use for technical documentation and integration compatibility information.

© 2025 B2ALABS. All rights reserved.