Skip to content

Consolidate EmbeddingMetadataStorage into VectorDB #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 19 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,25 +8,16 @@
</picture>
</p>


<h3 align="center">
Reliable and Efficient Semantic Prompt Caching
</h3>
<br>




Semantic caching reduces LLM latency and cost by returning cached model responses for semantically similar prompts (not just exact matches). **vCache** is the first verified semantic cache that **guarantees user-defined error rate bounds**. vCache replaces static thresholds with **online-learned, embedding-specific decision boundaries**—no manual fine-tuning required. This enables reliable cached response reuse across any embedding model or workload.



> [NOTE]
> vCache is currently in active development. Features and APIs may change as we continue to improve the system.




## 🚀 Quick Install

Install vCache in editable mode:
Expand All @@ -40,6 +31,7 @@ Then, set your OpenAI key:
```bash
export OPENAI_API_KEY="your_api_key_here"
```

(Note: vCache uses OpenAI by default for both LLM inference and embedding generation, but you can configure any other backend)

Finally, use vCache in your Python code:
Expand All @@ -53,16 +45,14 @@ print(f"Response: {response}")
```

By default, vCache uses:

- `OpenAIInferenceEngine`
- `OpenAIEmbeddingEngine`
- `HNSWLibVectorDB`
- `InMemoryEmbeddingMetadataStorage`
- `NoEvictionPolicy`
- `StringComparisonSimilarityEvaluator`
- `VerifiedDecisionPolicy` with a maximum failure rate of 2%



## ⚙️ Advanced Configuration

vCache is modular and highly configurable. Below is an example showing how to customize key components:
Expand All @@ -75,12 +65,12 @@ from vcache.main import VCache
from vcache.config import VCacheConfig
from vcache.inference_engine.strategies.open_ai import OpenAIInferenceEngine
from vcache.vcache_core.cache.embedding_engine.strategies.open_ai import OpenAIEmbeddingEngine
from vcache.vcache_core.cache.embedding_store.embedding_metadata_storage.strategies.in_memory import InMemoryEmbeddingMetadataStorage
from vcache.vcache_core.similarity_evaluator.strategies.string_comparison import StringComparisonSimilarityEvaluator
from vcache.vcache_policy.strategies.dynamic_local_threshold import VerifiedDecisionPolicy
from vcache.vcache_policy.vcache_policy import VCachePolicy
from vcache.vcache_core.cache.embedding_store.vector_db import HNSWLibVectorDB, SimilarityMetricType
from vcache.vcache_core.cache.vector_db import HNSWLibVectorDB, SimilarityMetricType
```

</details>

```python
Expand All @@ -92,7 +82,6 @@ vcache_config: VCacheConfig = VCacheConfig(
similarity_metric_type=SimilarityMetricType.COSINE,
max_capacity=100_000,
),
embedding_metadata_storage=InMemoryEmbeddingMetadataStorage(),
similarity_evaluator=StringComparisonSimilarityEvaluator,
)

Expand All @@ -117,16 +106,16 @@ Semantic caching reduces LLM latency and cost by returning cached model response
### Architecture Overview

1. **Embed & Store**
Each prompt is converted to a fixed-length vector (an “embedding”) and stored in a vector database along with its LLM response.
Each prompt is converted to a fixed-length vector (an “embedding”) and stored in a vector database along with its LLM response.

2. **Nearest-Neighbor Lookup**
When a new prompt arrives, the cache embeds it and finds its most similar stored prompt using a similarity metric (e.g., cosine similarity).
When a new prompt arrives, the cache embeds it and finds its most similar stored prompt using a similarity metric (e.g., cosine similarity).

3. **Similarity Score**
The system computes a score between 0 and 1 that quantifies how “close” the new prompt is to the retrieved entry.
The system computes a score between 0 and 1 that quantifies how “close” the new prompt is to the retrieved entry.

4. **Decision: Exploit vs. Explore**
- **Exploit (cache hit):** If the similarity is above a confidence bound, return the cached response.
4. **Decision: Exploit vs. Explore**
- **Exploit (cache hit):** If the similarity is above a confidence bound, return the cached response.
- **Explore (cache miss):** Otherwise, infer the LLM for a response, add its embedding and answer to the cache, and return it.

<p align="left">
Expand All @@ -139,6 +128,7 @@ The system computes a score between 0 and 1 that quantifies how “close” the
</p>

### Why Fixed Thresholds Fall Short

Existing semantic caches rely on a **global static threshold** to decide whether to reuse a cached response (exploit) or invoke the LLM (explore). If the similarity score exceeds this threshold, the cache reuses the response; otherwise, it infers the model. This strategy is fundamentally limited.

- **Uniform threshold, diverse prompts:** A fixed threshold assumes all embeddings are equally distributed—ignoring that similarity is context-dependent.
Expand All @@ -161,11 +151,11 @@ vCache overcomes these limitations with two ideas:
### Benefits

- **Reliability**
Formally bounds the rate of incorrect cache hits to your chosen tolerance.
Formally bounds the rate of incorrect cache hits to your chosen tolerance.
- **Performance**
Matches or exceeds static-threshold systems in cache hit rate and end-to-end latency.
Matches or exceeds static-threshold systems in cache hit rate and end-to-end latency.
- **Simplicity**
Plug in any embedding model; vCache learns and adapts automatically at runtime.
Plug in any embedding model; vCache learns and adapts automatically at runtime.

<p align="left">
<picture>
Expand All @@ -178,25 +168,24 @@ vCache overcomes these limitations with two ideas:

Please refer to the [vCache paper](https://arxiv.org/abs/2502.03771) for further details.


## 🛠 Developer Guide

For advanced usage and development setup, see the [Developer Guide](ReadMe_Dev.md).



## 📊 Benchmarking vCache

vCache includes a benchmarking framework to evaluate:

- **Cache hit rate**
- **Error rate**
- **Latency improvement**
- **...**

We provide three open benchmarks:
- **SemCacheLmArena** (chat-style prompts) - [Dataset ↗](https://huggingface.co/datasets/vCache/SemBenchmarkLmArena)
- **SemCacheClassification** (classification queries) - [Dataset ↗](https://huggingface.co/datasets/vCache/SemBenchmarkClassification)
- **SemCacheSearchQueries** (real-world search logs) - [Dataset ↗](https://huggingface.co/datasets/vCache/SemBenchmarkSearchQueries)

- **SemCacheLmArena** (chat-style prompts) - [Dataset ↗](https://huggingface.co/datasets/vCache/SemBenchmarkLmArena)
- **SemCacheClassification** (classification queries) - [Dataset ↗](https://huggingface.co/datasets/vCache/SemBenchmarkClassification)
- **SemCacheSearchQueries** (real-world search logs) - [Dataset ↗](https://huggingface.co/datasets/vCache/SemBenchmarkSearchQueries)

See the [Benchmarking Documentation](benchmarks/ReadMe.md) for instructions.

Expand All @@ -211,4 +200,4 @@ If you use vCache for your research, please cite our [paper](https://arxiv.org/a
journal={arXiv preprint arXiv:2502.03771},
year={2025}
}
```
```
14 changes: 5 additions & 9 deletions benchmarks/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,16 +23,13 @@
from vcache.vcache_core.cache.embedding_engine.strategies.benchmark import (
BenchmarkEmbeddingEngine,
)
from vcache.vcache_core.cache.embedding_store.embedding_metadata_storage import (
InMemoryEmbeddingMetadataStorage,
)
from vcache.vcache_core.cache.embedding_store.embedding_metadata_storage.embedding_metadata_obj import (
EmbeddingMetadataObj,
)
from vcache.vcache_core.cache.embedding_store.vector_db import (
from vcache.vcache_core.cache.vector_db import (
HNSWLibVectorDB,
SimilarityMetricType,
)
from vcache.vcache_core.cache.vector_db.embedding_metadata_obj import (
EmbeddingMetadataObj,
)
from vcache.vcache_core.similarity_evaluator import SimilarityEvaluator
from vcache.vcache_core.similarity_evaluator.strategies.llm_comparison import (
LLMComparisonSimilarityEvaluator,
Expand Down Expand Up @@ -396,7 +393,7 @@ def dump_results_to_json(self):
var_ts_dict = {}

metadata_objects: List[EmbeddingMetadataObj] = (
self.vcache.vcache_config.embedding_metadata_storage.get_all_embedding_metadata_objects()
self.vcache.vcache_config.vector_db.get_all_embedding_metadata_objects()
)

for metadata_object in metadata_objects:
Expand Down Expand Up @@ -486,7 +483,6 @@ def __run_baseline(
similarity_metric_type=SimilarityMetricType.COSINE,
max_capacity=MAX_VECTOR_DB_CAPACITY,
),
embedding_metadata_storage=InMemoryEmbeddingMetadataStorage(),
similarity_evaluator=similarity_evaluator,
)
vcache: VCache = VCache(vcache_config, vcache_policy)
Expand Down
6 changes: 1 addition & 5 deletions test.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,7 @@
from vcache.vcache_core.cache.embedding_engine.strategies.open_ai import (
OpenAIEmbeddingEngine,
)
from vcache.vcache_core.cache.embedding_store.embedding_metadata_storage.strategies.in_memory import (
InMemoryEmbeddingMetadataStorage,
)
from vcache.vcache_core.cache.embedding_store.vector_db import (
from vcache.vcache_core.cache.vector_db import (
HNSWLibVectorDB,
SimilarityMetricType,
)
Expand All @@ -27,7 +24,6 @@
similarity_metric_type=SimilarityMetricType.COSINE,
max_capacity=100000,
),
embedding_metadata_storage=InMemoryEmbeddingMetadataStorage(),
similarity_evaluator=StringComparisonSimilarityEvaluator,
)
vcache: VCache = VCache(vcache_config, vcache_policy)
6 changes: 3 additions & 3 deletions tests/integration/test_concurrency.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@

from vcache import (
HNSWLibVectorDB,
InMemoryEmbeddingMetadataStorage,
LangChainEmbeddingEngine,
StringComparisonSimilarityEvaluator,
VCache,
Expand Down Expand Up @@ -46,7 +45,6 @@ def answers_similar(a, b):
model_name="sentence-transformers/all-mpnet-base-v2"
),
vector_db=HNSWLibVectorDB(),
embedding_metadata_storage=InMemoryEmbeddingMetadataStorage(),
similarity_evaluator=similarity_evaluator,
)

Expand Down Expand Up @@ -93,7 +91,9 @@ def do_inference(prompt):
time.sleep(1.5)
executor.map(do_inference, concurrent_prompts_chunk_2)

all_metadata_objects = vcache.vcache_config.embedding_metadata_storage.get_all_embedding_metadata_objects()
all_metadata_objects = (
vcache.vcache_config.vector_db.get_all_embedding_metadata_objects()
)
final_observation_count = len(all_metadata_objects)

for i, metadata_object in enumerate(all_metadata_objects):
Expand Down
2 changes: 0 additions & 2 deletions tests/integration/test_dynamic_threshold.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

from vcache import (
HNSWLibVectorDB,
InMemoryEmbeddingMetadataStorage,
LangChainEmbeddingEngine,
OpenAIInferenceEngine,
VCache,
Expand All @@ -25,7 +24,6 @@ def create_default_config_and_policy():
model_name="sentence-transformers/all-mpnet-base-v2"
),
vector_db=HNSWLibVectorDB(),
embedding_metadata_storage=InMemoryEmbeddingMetadataStorage(),
system_prompt="Please answer in a single word with the first letter capitalized. Example: London",
)
policy = VerifiedDecisionPolicy(delta=0.05)
Expand Down
2 changes: 0 additions & 2 deletions tests/integration/test_static_threshold.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
from vcache import (
BenchmarkStaticDecisionPolicy,
HNSWLibVectorDB,
InMemoryEmbeddingMetadataStorage,
LangChainEmbeddingEngine,
OpenAIInferenceEngine,
VCache,
Expand All @@ -25,7 +24,6 @@ def create_default_config_and_policy():
model_name="sentence-transformers/all-mpnet-base-v2"
),
vector_db=HNSWLibVectorDB(),
embedding_metadata_storage=InMemoryEmbeddingMetadataStorage(),
)
policy = BenchmarkStaticDecisionPolicy(threshold=0.8)
return config, policy
Expand Down
32 changes: 0 additions & 32 deletions tests/unit/EmbeddingMetadataStrategy/test_embedding_metadata.py

This file was deleted.

6 changes: 2 additions & 4 deletions tests/unit/VCachePolicyStrategy/test_vcache_policy.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from unittest.mock import MagicMock, patch

from vcache.config import VCacheConfig
from vcache.vcache_core.cache.embedding_store.embedding_metadata_storage.embedding_metadata_obj import (
from vcache.vcache_core.cache.vector_db import (
EmbeddingMetadataObj,
)
from vcache.vcache_policy.strategies.verified import (
Expand Down Expand Up @@ -48,11 +48,9 @@ def update_metadata(embedding_id, embedding_metadata):
mock_config = MagicMock(spec=VCacheConfig)
mock_config.inference_engine = self.mock_inference_engine
mock_config.similarity_evaluator = self.mock_similarity_evaluator
# Add all required attributes for Cache creation
mock_config.embedding_engine = MagicMock()
mock_config.embedding_metadata_storage = MagicMock()
mock_config.vector_db = MagicMock()
mock_config.eviction_policy = MagicMock()
mock_config.embedding_engine = MagicMock()

self.policy = VerifiedDecisionPolicy()
self.policy.setup(mock_config)
Expand Down
Loading