TurboQuantIndex vs IdMapIndex
Two index types:
TurboQuantIndex: position-indexed. Simpler, slightly faster. Use when your IDs come from index position.
IdMapIndex: supports stable external IDs (uint64) and O(1) deletes. Use when your IDs come from elsewhere (database primary keys) or when you need to delete documents.
For most RAG systems, IdMapIndex is the right default. Document IDs from your application database survive deletes and rebuilds without breaking external references.
from turbovec import IdMapIndex
import numpy as np
index = IdMapIndex(dim=1536, bit_width=4)
index.add_with_ids(vectors, np.array([1001, 1002, 1003], dtype=np.uint64))
scores, ids = index.search(query, k=10) # returns your IDs
index.remove(1002) # O(1) by ID
Pairing with embedding models
For air-gapped RAG, the embedding model has to run locally — using OpenAI’s API defeats the privacy benefit. Two well-tested combinations:
BGE-M3 + turbovec: BGE-M3 is a strong multilingual embedding model that runs locally. Good default for international knowledge bases.
BGE-large-en-v1.5 + turbovec: English-only, smaller, faster. Good default for English documentation.
If privacy isn’t a hard constraint, OpenAI’s text-embedding-3-small (1536 dim) or text-embedding-3-large (3072 dim) work cleanly with turbovec, but the cloud embedding call is the leak point, not the index.
Framework integrations
turbovec ships first-class integrations for the three RAG frameworks that matter most — LangChain, LlamaIndex, and Haystack:
pip install turbovec[langchain]
pip install turbovec[llama-index]
pip install turbovec[haystack]
If your stack is already on one of these, the integration is the path of least resistance.
Honest operational gotchas
A few things to know going in:
Single-machine, in-memory: No distributed mode and horizontal scaling means sharding at your application layer.
Exhaustive scan: turbovec scans every compressed vector on every query. Fast up to ~100M vectors thanks to SIMD; beyond that, you’d want an IVF or graph-based layer on top.
Hardware matters: The 12–20% ARM advantage is measured on Apple M-series chips. On older ARM or non-AVX-512 x86, the gains are smaller (though it still beats FAISS in most configurations).
Low-dim recall gap: Already mentioned, worth repeating: at d ≤ 256, turbovec gives up real recall points to FAISS. The algorithm shines at the dimensions modern embedding models produce (768+).
The reframe
turbovec sits in a specific sweet spot for RAG: in-memory vector search, mid-scale corpus, owned infrastructure, dimensions that modern embedding models produce.
It’s not trying to replace Qdrant or Pinecone, it’s the lower-level building block that fits where you’d otherwise reach for FAISS, with better compression and no training pass.
The deeper story is worth pausing on. A four-author Google Research paper got accepted at ICLR 2026.
The authors at Google didn’t ship code. Two community developers built independent implementations: one for KV-cache compression, one for vector search.
The vector-search one beats the production gold standard on its own benchmarks.
That pattern:
is going to keep happening as research output accelerates faster than vendors can productize it. turbovec is one worked example.
Knowing how it works, and when to use it, is useful for any developer or architect building RAG today.
Sources & further reading
turbovec on GitHub
Google Research blog on TurboQuant
FAISS Fast accumulation of PQ codes (FastScan)
hackimov/turboquant-kv