RAG improves LLM responses by grounding them in external data instead of just what the model memorized in training. We benchmarked the components a RAG system is built from and gathered the results in one place, with a practical guide to choosing each part of the stack.
See our benchmark results for each RAG component, our guide to choosing a RAG stack, or the RAG fundamentals: what it is, how it works, and where it fits.
RAG benchmark results
Embedding models
The embedding model converts both your documents and the user’s query into vectors, so it sets the ceiling on retrieval quality.
We benchmarked 15 dense embedding models plus a BM25 lexical baseline across three domains (legal contracts/CUAD, customer support/TechQA, and healthcare/MedRAG), scoring each on nDCG@3.
voyage-3.5 ranks first at 0.9429 and beats Voyage’s own voyage-4-large flagship while costing half as much ($0.060 vs $0.120 per 1M tokens). The newest, largest model is not automatically the best buy. For cost-first stacks, perplexity’s pplx-embed-v1-0.6b delivers about 92% of voyage-3.5’s quality (0.8604) at roughly one-fifteenth of the price ($0.004/1M). For the accuracy-against-price view, see the cost chart in the full embedding models benchmark, which also has the per-domain breakdown and methodology.
Beyond single-vector dense embeddings, late-interaction (multi-vector) retrievers such as ColBERT (and ColPali/ColQwen for visual-document and PDF retrieval) keep a vector per token for finer matching and stronger out-of-domain generalization, at a much larger index (ColPali stores roughly 1,000× more vectors per item; see our multimodal embedding benchmark).
If your corpus is multilingual or visual, the embedding choice changes: our multilingual embedding benchmark found a 110M-parameter model (e5_base) led all six languages and beat models up to 70× larger, and our multimodal benchmark put Apple’s DFN5B-H on top at 50.1% text-to-image Recall@1. For teams that cannot send data to an API, our open-source embedding benchmark ranks NVIDIA’s Nemotron-8B first (0.9249 nDCG@3), with Microsoft’s MIT-licensed 0.6B Harrier-oss the strongest unrestricted-commercial option.
Reranking
A bi-encoder retriever is fast but approximate. A reranker is a cross-encoder that re-scores the top candidates the retriever returned, reading each query–document pair together to push the truly relevant chunks to the top before they reach the LLM. The canonical 2026 pipeline is to retrieve a wide set, rerank it down, then send 3–5 chunks to the model.1
We benchmarked 8 rerankers on English retrieval (top-100 candidates, 300 queries):
Adding a reranker lifted top-1 accuracy (Hit@1) from 62.67% to 83.00%, a 20.33-point jump from a single extra stage. The result that should change a buying decision: a 149M-parameter model (gte-reranker-modernbert-base) matched a 1.2B model at the top, so the biggest reranker is not the one to reach for. The full reranker benchmark covers latency and the Hit@10 ceiling.
Vector databases
The vector database stores your embeddings and serves the nearest-neighbor search at query time, so it sets the latency floor and a large share of the running cost. We benchmarked six managed services on a 1-million-vector, 768-dimension dataset, measuring average query latency and monthly cost.
There is no single winner, only a latency/cost frontier. Zilliz Cloud was fastest (26 ms) and Qdrant close behind (39 ms), while Pinecone was the cheapest at $60/100GB but also the slowest (102 ms), and MongoDB Atlas was the most expensive by a wide margin ($1,440/100GB). All six now support native hybrid search (dense vectors plus BM25 keyword matching), with Reciprocal Rank Fusion (RRF) the default way to merge the two result lists. The full vector database benchmark includes the hybrid-support matrix and a storage calculator.
How to choose your RAG stack
The benchmarks above answer “Which component is best in isolation?” This section answers “How do I assemble them?” Walk the pipeline in order and pick each stage by use case, scale, and budget:
- Chunking: split documents into ~300–500 token passages with 10–20% overlap; prefer semantic/structure-aware splitting over fixed sizes for heterogeneous documents.
- Embedding model: voyage-3.5 for best quality-per-dollar on an API; qwen3-embedding-8b or NVIDIA Nemotron-8B if you must self-host; pick a multilingual or multimodal model if your corpus needs it.
- Vector database: Zilliz/Qdrant when latency dominates; Pinecone or Elasticsearch when cost dominates; any of the six if you need native hybrid search.
- Hybrid retrieval: combine dense + BM25 with RRF; it is the 2026 default because lexical and semantic retrieval fail on different queries, so fusing them is more reliable than either alone.
- Reranking: add a cross-encoder (a 149M model is enough) to recover the ~20 points of top-1 accuracy a bi-encoder leaves on the table.
- Generation: use a model with grounded-citation support, so answers are source-attributable.
- Evaluation: wire in retrieval, generation, and end-to-end metrics before you ship.
Enterprise governance
For enterprise deployments, retrieval quality is necessary but not sufficient; the retrieval layer also has to be governed. Production RAG is expected to enforce permission-aware retrieval (results respect the source system’s access controls, so a user never retrieves a document they could not open directly), sync with identity providers (Okta, Azure AD, Auth0) so permission changes propagate in near real time, log every retrieval for audit, run input/output guardrails, and honor data-residency constraints. Treat these as table stakes, not add-ons, for any RAG system touching internal data.2
RAG vs. long context
With context windows reaching millions of tokens, a fair question is whether RAG is still necessary. In 2026, the answer is not either/or: RAG retrieves the relevant evidence, a long context window can refine over it, and a routing layer decides which path each query takes.
The decision usually comes down to cost. Because an LLM bills for every input token on every request, stuffing a full corpus into context is expensive at scale. For large knowledge bases under steady query load, RAG can run on the order of 1,250× cheaper per query than long-context stuffing, since it pays for a few thousand retrieved tokens instead of the whole archive each time.3
That advantage is conditional, and worth stating honestly: RAG wins on cost above roughly 500K tokens of corpus and a few thousand queries a day, while below ~200K tokens and a few hundred queries a day, long context with prompt caching often wins outright, because the vector database’s fixed hosting cost alone can exceed the entire long-context bill.4 Accuracy still favors retrieval for needle-in-haystack lookups, where filtering out irrelevant text reduces the “lost in the middle” attention drift that degrades long-context recall.
What are the available RAG models and tools?
RAG tooling falls into three groups: LLMs and APIs with built-in grounding, orchestration frameworks, and the underlying retrieval components (embedding models, vector databases, rerankers).
LLMs and APIs with built-in grounding
Several model providers now ship grounded-generation features so you can attach external knowledge with source attribution:
- Anthropic Claude: a Citations API that grounds answers in the documents you supply and returns references to the exact passages used.5
- Google Gemini: a built-in File Search tool that handles RAG for you (upload documents and Gemini chunks, embeds, and retrieves them at query time), plus Vertex AI RAG Engine for managed enterprise retrieval. Its separate “grounding with Google Search” feature pulls from the live web, not your own data.6
- Cohere Command: RAG-tuned models (Command R/R+ and the newer Command A) that return inline citations out of the box, paired with a dedicated Rerank endpoint.7
- OpenAI: a file-search retrieval tool in the Assistants and Responses APIs.8
RAG libraries and frameworks
These wire retrieval and generation into a pipeline:
- LangChain / LangGraph: general-purpose orchestration; LangGraph adds stateful, agentic retrieve-reflect-verify loops.
- LlamaIndex: data ingestion, indexing, and query engines.
- Haystack: end-to-end pipelines for search and question answering.
- DSPy: declarative, optimizer-driven prompt/retrieval programs.
For a deeper comparison, see our RAG frameworks analysis.
What is retrieval-augmented generation?
Retrieval-augmented generation is a technique that gives a large language model access to an external knowledge source at query time. Instead of answering only from parameters fixed during training, the model retrieves relevant passages from a document store and conditions its response on them. This keeps answers current, grounds them in citable sources, and reduces hallucination on knowledge-intensive tasks, without retraining the model.
How do RAG models work?
At its core, RAG runs in two phases: retrieval (find the passages relevant to the query) and generation (write an answer conditioned on those passages). In production systems, that core loop is wrapped in a fuller pipeline:
- Query rewriting/decomposition: rephrase or split the question to retrieve better, especially for multi-turn or multi-hop queries.
- Hybrid retrieval: run dense (vector) and sparse (BM25) searches and fuse the results with RRF.
- Reranking: a cross-encoder re-scores the candidates and keeps the top few.
- Context assembly: build the prompt from the selected chunks with citations.
- Generation: the LLM answers from the assembled context.
- Evaluation: score retrieval and answer quality, ideally in CI.
The two-phase loop is still the mental model; the extra stages are what separate a demo from a production system.
What are the different types of RAG?
Beyond the linear pipeline, several RAG variants target specific failure modes: Speculative RAG (draft-and-verify for speed), Retrieval-Augmented Fine-Tuning (RAFT) (train the model to use retrieved context), Self-RAG and Corrective RAG (CRAG) (the model critiques and re-retrieves when evidence is weak). These overlap with the advanced architectures below.
Advanced RAG architectures
Graph-based RAG (GraphRAG)
GraphRAG builds a knowledge graph over the corpus, often on a dedicated graph database such as Neo4j or FalkorDB, so the system can answer multi-hop and global-aggregation questions that flat vector search misses. Its edge on those questions comes largely from pre-computing relationships across the whole corpus rather than from better passage retrieval, so vector search still tends to win on specific-document lookups. The practical takeaway: reach for a graph when queries require global reasoning across many documents, not as a drop-in replacement for vector retrieval.
Agentic RAG
Agentic RAG puts an LLM agent in charge of retrieval: deciding what to fetch, which source or tool to call, and when to reflect and retry, looping until the answer is grounded. In our agentic RAG benchmark, which tests an agent that must route each question to the right database and then write SQL against it, the strongest models now route almost perfectly (Claude Opus 4.8 at 100%, Fable 5 at 98%), while writing correct SQL against the chosen schema stays the harder ceiling, topping out around 90%. Routing is close to solved; grounded execution is where agentic RAG still differentiates.
Hybrid, iterative, and active RAG
Hybrid retrieval (dense + sparse, covered above) is now the default rather than an advanced option. Iterative and active variants (e.g., FLARE) let the model retrieve repeatedly as it generates, fetching new evidence when its confidence drops.
How to evaluate RAG systems
RAG evaluation is now lifecycle-structured across three layers: retrieval (precision, recall, MRR, nDCG, hit@k: did we fetch the right chunks?), generation (groundedness, faithfulness: is the answer supported by the retrieved context?), and end-to-end (is the final answer correct?).
The tooling divides along the same lines: RAGAS for fast, reference-free iteration during development; DeepEval as a pytest-style pass/fail gate in CI so a regression blocks the build; and TruLens or Phoenix for tracing and monitoring in production. TREC-RAG and ARES are useful external references for judge calibration.9
Chunk size
Chunk size controls how documents are split before embedding.
The 2026 guidance has moved beyond a single fixed size: prefer semantic / structure-aware chunking (start a new chunk where adjacent sentences diverge in meaning), keep chunks around 300–500 tokens with 10–20% overlap, and consider contextual retrieval: Anthropic’s technique of prepending an LLM-generated context sentence to each chunk before embedding and BM25 indexing. In Anthropic’s tests, contextual embeddings cut the top-20 retrieval-failure rate by 35%, contextual embeddings plus contextual BM25 by 49%, and adding a reranker on top by 67%.10
Fine-Tuning vs. Retrieval-Augmented Generation
RAG and fine-tuning solve different problems, and in 2026 they are increasingly used together rather than as alternatives.
For most teams, the answer is “RAG first, fine-tune the behavior if needed,” and RAFT formalizes doing both.
Benefits of retrieval-augmented generation
RAG’s advantages cluster into a few that actually drive adoption: accuracy and freshness (answers reflect current, source-grounded data, not a frozen training cutoff), transparency (responses cite the passages they used, so they are auditable), lower cost than long context at scale, and adaptability (update the knowledge base instead of retraining the model). Multimodal RAG extends these to images, PDFs, and tables.
Further reading
- Embedding models benchmark
- Reranker benchmark
- Vector database for RAG
- Open-source embedding models
- Multilingual embedding models
- Multimodal embeddings
- Agentic RAG frameworks
Cite this benchmark
Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.
@misc{sar2026,
author = {Sarı, Ekrem},
title = {{Best RAG Tools, Frameworks, and Libraries}},
year = {2026},
month = jun,
howpublished = {\url{https://aimultiple.com/retrieval-augmented-generation}},
note = {AIMultiple. Retrieved June 30, 2026}
}
Be the first to comment
Your email address will not be published. All fields are required. Comments are left in their original language.