Benchmark

Best RAG Tools, Frameworks, and Libraries

updated on Jul 18, 2026

RAG improves LLM responses by grounding them in external data instead of just what the model memorized in training. We benchmarked the components a RAG system is built from and gathered the results in one place, with a practical guide to choosing each part of the stack.

See our benchmark results for each RAG component, our guide to choosing a RAG stack, or the RAG fundamentals: what it is, how it works, and where it fits.

RAG benchmark results

Embedding models

The embedding model converts both your documents and the user’s query into vectors, so it sets the ceiling on retrieval quality.

Loading Chart

We benchmarked 15 dense embedding models plus a BM25 lexical baseline across three domains (legal contracts/CUAD, customer support/TechQA, and healthcare/MedRAG), scoring each on nDCG@3.

voyage-3.5 ranks first at 0.9429 and beats Voyage’s own voyage-4-large flagship while costing half as much ($0.060 vs $0.120 per 1M tokens). The newest, largest model is not automatically the best buy. For cost-first stacks, perplexity’s pplx-embed-v1-0.6b delivers about 92% of voyage-3.5’s quality (0.8604) at roughly one-fifteenth of the price ($0.004/1M). For the accuracy-against-price view, see the cost chart in the full embedding models benchmark, which also has the per-domain breakdown and methodology.

Beyond single-vector dense embeddings, late-interaction (multi-vector) retrievers such as ColBERT (and ColPali/ColQwen for visual-document and PDF retrieval) keep a vector per token for finer matching and stronger out-of-domain generalization, at a much larger index (ColPali stores roughly 1,000× more vectors per item; see our multimodal embedding benchmark).

If your corpus is multilingual or visual, the embedding choice changes: our multilingual embedding benchmark found a 110M-parameter model (e5_base) led all six languages and beat models up to 70× larger, and our multimodal benchmark put Apple’s DFN5B-H on top at 50.1% text-to-image Recall@1. For teams that cannot send data to an API, our open-source embedding benchmark ranks NVIDIA’s Nemotron-8B first (0.9249 nDCG@3), with Microsoft’s MIT-licensed 0.6B Harrier-oss the strongest unrestricted-commercial option.

Reranking

A bi-encoder retriever is fast but approximate. A reranker is a cross-encoder that re-scores the top candidates the retriever returned, reading each query–document pair together to push the truly relevant chunks to the top before they reach the LLM. The canonical 2026 pipeline is to retrieve a wide set, rerank it down, then send 3–5 chunks to the model. ¹

We benchmarked 8 rerankers on English retrieval (top-100 candidates, 300 queries):

Adding a reranker lifted top-1 accuracy (Hit@1) from 62.67% to 83.00%, a 20.33-point jump from a single extra stage. The result that should change a buying decision: a 149M-parameter model (gte-reranker-modernbert-base) matched a 1.2B model at the top, so the biggest reranker is not the one to reach for. The full reranker benchmark covers latency and the Hit@10 ceiling.

Vector databases

The vector database stores your embeddings and serves the nearest-neighbor search at query time, so it sets the latency floor and a large share of the running cost. We benchmarked seven open-source, self-hosted engines on identical bge-m3 embeddings, each read at a matched Recall@10 of 0.95 so the index was the only variable.

The seven tie on retrieval accuracy. nDCG@10 lands between 0.803 and 0.817, a spread of 0.014, against a 10x spread in single-thread throughput (Redis 764 QPS, LanceDB 70) and a 3.7x spread in peak memory at 2.25M vectors (Milvus 17.0 GB, Chroma 62.4 GB). The engine is a speed, memory, and workload decision rather than an accuracy one, because the embedding model sets the quality ceiling at this operating point.

Which engine fits follows from the workload. Redis recorded 1.7 ms p95 at 559 MB of RAM, with persistence off. Weaviate reached 8,330 QPS at 32 worker processes, where Redis saturated at 1,642. Milvus held 17.0 GB at 2.25M vectors against Chroma’s 62.4 GB, and kept the highest worst-case recall under metadata filters (0.984). Qdrant recorded a +0.067 nDCG hybrid lift with native fusion. Two engines carry hard limits: Chroma ships no keyword search in its self-hosted build and returns a 13-second p99 at 512 concurrent clients, and LanceDB absorbs 2.6 single-row writes per second, which rules it out for a continuously updated knowledge base. The full open-source vector database benchmark covers filtered search, build cost, and live churn, and the vector database sizing calculator turns those limits into a per-engine verdict for a specific server.

How to choose your RAG stack

The benchmarks above answer “Which component is best in isolation?” This section answers “How do I assemble them?” Walk the pipeline in order and pick each stage by use case, scale, and budget:

Chunking: split documents into ~300–500 token passages with 10–20% overlap; prefer semantic/structure-aware splitting over fixed sizes for heterogeneous documents.
Embedding model: voyage-3.5 for best quality-per-dollar on an API; qwen3-embedding-8b or NVIDIA Nemotron-8B if you must self-host; pick a multilingual or multimodal model if your corpus needs it.
Vector database: Redis when single-query latency dominates, Weaviate or Milvus for sustained concurrency, Milvus when memory is the constraint at scale, pgvector when the stack is already on Postgres; four of the seven (Qdrant, Milvus, Weaviate, LanceDB) fuse hybrid results natively. Size the index against the server first, since a 16 GB box holds about 1.5M vectors on Redis and 3.7M on Qdrant at 1024 dimensions.
Hybrid retrieval: combine dense + BM25 with RRF, which lifted nDCG@10 by 0.030 to 0.067 across the engines that carry a keyword arm in our vector database benchmark; the lift clears zero at 95% confidence for Qdrant, LanceDB, Redis, and Milvus, and does not for pgvector or Weaviate.
Reranking: add a cross-encoder (a 149M model is enough) to recover the ~20 points of top-1 accuracy a bi-encoder leaves on the table.
Generation: use a model with grounded-citation support, so answers are source-attributable.
Evaluation: wire in retrieval, generation, and end-to-end metrics before you ship.

Enterprise governance

For enterprise deployments, retrieval quality is necessary but not sufficient; the retrieval layer also has to be governed. Production RAG is expected to enforce permission-aware retrieval (results respect the source system’s access controls, so a user never retrieves a document they could not open directly), sync with identity providers (Okta, Azure AD, Auth0) so permission changes propagate in near real time, log every retrieval for audit, run input/output guardrails, and honor data-residency constraints. Treat these as table stakes, not add-ons, for any RAG system touching internal data. ² Those controls have to hold in the retrieval layer, not only in the application above it, and the open-source engines differ on what they can enforce: of the seven we benchmarked, pgvector alone offers point-in-time recovery and row-level security, Qdrant, Milvus, and Weaviate ship replication and RBAC in their open-source builds, Chroma 1.x ships no authentication at all, and none of the seven encrypts data at rest natively, which leaves that to disk or volume encryption.

RAG vs. long context

With context windows reaching millions of tokens, a fair question is whether RAG is still necessary. In 2026, the answer is not either/or: RAG retrieves the relevant evidence, a long context window can refine over it, and a routing layer decides which path each query takes.

The decision usually comes down to cost. Because an LLM bills for every input token on every request, stuffing a full corpus into context is expensive at scale. For large knowledge bases under steady query load, RAG can run on the order of 1,250× cheaper per query than long-context stuffing, since it pays for a few thousand retrieved tokens instead of the whole archive each time. ³

That advantage is conditional, and worth stating honestly: RAG wins on cost above roughly 500K tokens of corpus and a few thousand queries a day, while below ~200K tokens and a few hundred queries a day, long context with prompt caching often wins outright, because the vector database’s fixed hosting cost alone can exceed the entire long-context bill. ⁴ Our sizing model puts that floor in concrete terms. A 2 GB corpus at 512-token chunks becomes about 1.15M vectors, which needs 5.1 GB of RAM on Qdrant or 6.9 GB on Milvus, a server that costs the same whether or not a query arrives. Accuracy still favors retrieval for needle-in-haystack lookups, where filtering out irrelevant text reduces the “lost in the middle” attention drift that degrades long-context recall.

What are the available RAG models and tools?

RAG tooling falls into three groups: LLMs and APIs with built-in grounding, orchestration frameworks, and the underlying retrieval components (embedding models, vector databases, rerankers).

LLMs and APIs with built-in grounding

Several model providers now ship grounded-generation features so you can attach external knowledge with source attribution:

Anthropic Claude: a Citations API that grounds answers in the documents you supply and returns references to the exact passages used. ⁵
Google Gemini: a built-in File Search tool that handles RAG for you (upload documents and Gemini chunks, embeds, and retrieves them at query time), plus Vertex AI RAG Engine for managed enterprise retrieval. Its separate “grounding with Google Search” feature pulls from the live web, not your own data. ⁶
Cohere Command: RAG-tuned models (Command R/R+ and the newer Command A) that return inline citations out of the box, paired with a dedicated Rerank endpoint. ⁷
OpenAI: a file-search retrieval tool in the Assistants and Responses APIs. ⁸

RAG libraries and frameworks

These wire retrieval and generation into a pipeline:

LangChain / LangGraph: general-purpose orchestration; LangGraph adds stateful, agentic retrieve-reflect-verify loops.
LlamaIndex: data ingestion, indexing, and query engines.
Haystack: end-to-end pipelines for search and question answering.
DSPy: declarative, optimizer-driven prompt/retrieval programs.

For a deeper comparison, see our RAG frameworks analysis.

What is retrieval-augmented generation?

Retrieval-augmented generation is a technique that gives a large language model access to an external knowledge source at query time. Instead of answering only from parameters fixed during training, the model retrieves relevant passages from a document store and conditions its response on them. This keeps answers current, grounds them in citable sources, and reduces hallucination on knowledge-intensive tasks, without retraining the model.

How do RAG models work?

At its core, RAG runs in two phases: retrieval (find the passages relevant to the query) and generation (write an answer conditioned on those passages). In production systems, that core loop is wrapped in a fuller pipeline:

Query rewriting/decomposition: rephrase or split the question to retrieve better, especially for multi-turn or multi-hop queries.
Hybrid retrieval: run dense (vector) and sparse (BM25) searches and fuse the results with RRF.
Reranking: a cross-encoder re-scores the candidates and keeps the top few.
Context assembly: build the prompt from the selected chunks with citations.
Generation: the LLM answers from the assembled context.
Evaluation: score retrieval and answer quality, ideally in CI.

The two-phase loop is still the mental model; the extra stages are what separate a demo from a production system.

Get our team to automate one of your business processes with AI agents, free of charge.

Automate a process

What are the different types of RAG?

Beyond the linear pipeline, several RAG variants target specific failure modes: Speculative RAG (draft-and-verify for speed), Retrieval-Augmented Fine-Tuning (RAFT) (train the model to use retrieved context), Self-RAG and Corrective RAG (CRAG) (the model critiques and re-retrieves when evidence is weak). These overlap with the advanced architectures below.

Advanced RAG architectures

Graph-based RAG (GraphRAG)

GraphRAG builds a knowledge graph over the corpus, often on a dedicated graph database such as Neo4j or FalkorDB, so the system can answer multi-hop and global-aggregation questions that flat vector search misses. Its edge on those questions comes largely from pre-computing relationships across the whole corpus rather than from better passage retrieval, so vector search still tends to win on specific-document lookups. The practical takeaway: reach for a graph when queries require global reasoning across many documents, not as a drop-in replacement for vector retrieval.

Agentic RAG

Agentic RAG puts an LLM agent in charge of retrieval: deciding what to fetch, which source or tool to call, and when to reflect and retry, looping until the answer is grounded. In our agentic RAG benchmark, which tests an agent that must route each question to the right database and then write SQL against it, the strongest models now route almost perfectly (Claude Opus 4.8 at 100%, Fable 5 at 98%), while writing correct SQL against the chosen schema stays the harder ceiling, topping out around 90%. Routing is close to solved; grounded execution is where agentic RAG still differentiates.

Hybrid, iterative, and active RAG

Hybrid retrieval (dense + sparse, covered above) is now the default rather than an advanced option. Iterative and active variants (e.g., FLARE) let the model retrieve repeatedly as it generates, fetching new evidence when its confidence drops.

See more of our benchmarks and data-driven insights in Google Search.

Add as preferred source

How to evaluate RAG systems

RAG evaluation is now lifecycle-structured across three layers: retrieval (precision, recall, MRR, nDCG, hit@k: did we fetch the right chunks?), generation (groundedness, faithfulness: is the answer supported by the retrieved context?), and end-to-end (is the final answer correct?).

The tooling divides along the same lines: RAGAS for fast, reference-free iteration during development; DeepEval as a pytest-style pass/fail gate in CI so a regression blocks the build; and TruLens or Phoenix for tracing and monitoring in production. TREC-RAG and ARES are useful external references for judge calibration. ⁹

Retrieval metrics split in two once a vector database is in the loop, and the halves can move in opposite directions. ANN recall asks whether the index returned the true nearest vectors, which isolates the database; nDCG and MRR against human labels ask whether those documents are relevant, which is mostly a property of the embedding model. Scaling a corpus from 50k to 2.25M vectors in our vector database benchmark dropped nDCG@10 from about 0.81 to 0.56 while every engine still reported Recall@10 above 0.973, and the exact-kNN oracle fell to the same 0.572. A geometric-only benchmark would have reported a healthy index over a corpus that had lost a third of its answer quality.

Metric category	Key metric	What it measures
Retrieval	MRR / nDCG	Is the most relevant chunk ranked first?
Retrieval	Recall@k	Did we miss any crucial document?
Retrieval	ANN Recall@10	Did the index return the true nearest vectors?
Generation	Faithfulness	Is the answer free of hallucination?
Safety	Negative rejection	Does the model refuse when evidence is absent?
Framework	RAGAS / DeepEval / TruLens	Dev iteration / CI gate/production tracing

Chunk size

Chunk size controls how documents are split before embedding.

The 2026 guidance has moved beyond a single fixed size: prefer semantic / structure-aware chunking (start a new chunk where adjacent sentences diverge in meaning), keep chunks around 300–500 tokens with 10–20% overlap, and consider contextual retrieval: Anthropic’s technique of prepending an LLM-generated context sentence to each chunk before embedding and BM25 indexing. In Anthropic’s tests, contextual embeddings cut the top-20 retrieval-failure rate by 35%, contextual embeddings plus contextual BM25 by 49%, and adding a reranker on top by 67%. ¹⁰ Chunk size also sets how large the index gets, since it decides how many vectors the corpus becomes. Our vector database sizing calculator makes the link explicit: at its default 512-token chunk and 15% overlap the corpus advances 435 tokens per chunk, so halving the chunk roughly doubles both the vector count and the memory the database has to hold.

Fine-Tuning vs. Retrieval-Augmented Generation

RAG and fine-tuning solve different problems, and in 2026 they are increasingly used together rather than as alternatives.

Category	RAG	Fine-Tuning
Knowledge access	Retrieves external, updatable information at query time	Bakes knowledge into weights; static until retrained
Up-to-date data	Incorporates the latest data without retraining	Requires retraining to update
Transparency	Answers cite retrieved sources	The decision process is opaque
Best for	Knowledge-intensive, fast-changing domains	Fixed style, format, or task behavior
Combine both	NA	RAFT trains the model to use retrieved context, getting fine-tuning’s behavior with RAG’s freshness

For most teams, the answer is “RAG first, fine-tune the behavior if needed,” and RAFT formalizes doing both.

Benefits of retrieval-augmented generation

RAG’s advantages cluster into a few that actually drive adoption: accuracy and freshness (answers reflect current, source-grounded data, not a frozen training cutoff), transparency (responses cite the passages they used, so they are auditable), lower cost than long context at scale, and adaptability (update the knowledge base instead of retraining the model). Multimodal RAG extends these to images, PDFs, and tables.

Cite this benchmark

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Ekrem Sarı (2026) - "Best RAG Tools, Frameworks, and Libraries". Published online at AIMultiple.com. Retrieved July 18, 2026, from: https://aimultiple.com/retrieval-augmented-generation [Online Resource]

Sarı, E. (2026, July 18). Best RAG Tools, Frameworks, and Libraries. AIMultiple. https://aimultiple.com/retrieval-augmented-generation

@misc{sari2026,
  author = {Sarı, Ekrem},
  title  = {{Best RAG Tools, Frameworks, and Libraries}},
  year   = {2026},
  month  = jul,
  howpublished    = {\url{https://aimultiple.com/retrieval-augmented-generation}},
  note   = {AIMultiple. Retrieved July 18, 2026}
}