Vector RAG retrieves documents by semantic similarity. Graph RAG adds a knowledge graph on top: it extracts entities and relationships from your documents, stores them in a graph database, and uses graph traversal alongside vector search at query time.
We benchmarked whether this extra layer improves answer accuracy on 3,904 Amazon electronics reviews with 905 queries across three question types.
Benchmark results by query type
Example questions:
Brand complaints / Brand praise: “What are the most common complaints about Sony?”
Global pattern: “What is the most common complaint across all products?”
Graph RAG answers brand-level questions correctly 82% of the time. Vector RAG gets 15%. The gap comes from how each pipeline handles the question:
“Which brands have the most charging complaints?”
- Vector RAG retrieves 10 reviews that mention charging, and the LLM guesses which brand has the most complaints.
- Vector RAG + metadata filters by brand and sentiment before searching, so it finds more relevant reviews. But it still retrieves 10 reviews, and the LLM still guesses.
- Graph RAG traverses all reviews that mention “charging” as a negative feature, groups by brand, and counts.
What is Graph RAG
Graph RAG adds a knowledge graph on top of vector search. Vector search still runs. The graph adds structured computation that vector search cannot do.
At index time, each document goes through two paths. An LLM extracts entities (brands, features, sentiment) and stores them as nodes and edges in a graph database. The same documents are embedded and stored in a vector index.
At query time, the system retrieves from both paths: embedding similarity and graph traversal. The results merge into a single ranked list. For aggregation queries, the graph also pre-computes counts and groupings from the full dataset. This pre-computed aggregation is what separates Graph RAG from Vector RAG.
Why the accuracy gap is computation, not retrieval
We isolated the graph’s contribution with three generation modes:
Default: Graph RAG gets Cypher aggregation computed from the full knowledge graph (3,904 reviews). Vector pipelines get raw review text.
Fair context: All pipelines get aggregated feature counts computed from their own retrieved reviews. Same type of structured context, but from top-10 instead of the full graph.
No context: Raw review text only.
Vector RAG + metadata jumps from 40.9% to 74.0% when given structured context.
Without structured context, graph-only drops to 39.7%. Nearly identical to Vector RAG + metadata at 40.9%. Both retrieve similar documents. The graph’s value is what it computes from them.
How the knowledge graph is built
An LLM reads each review and extracts entities and relationships. For example, from a single review:
“The Sony WH-1000XM4 has amazing noise cancellation, but the battery only lasts 20 hours, not 30 as advertised.”
The LLM extracts:
- Brand: Sony
- Product: WH-1000XM4
- Positive feature: noise cancellation
- Negative feature: battery life
The schema determines which questions are cheap:
Separating HAS_POSITIVE and HAS_NEGATIVE makes “top complaints for Sony” a single traversal. Without sentiment-labeled edges, the LLM would read and classify every review at query time.
If “Sony” isn’t extracted, no graph traversal finds it. If “batteries” and “battery life” don’t resolve to the same node, the counts are wrong. Different domains need different schemas. A wrong schema means the graph adds complexity without capability.
How Graph RAG retrieves and generates answers
Example: “What are the most common complaints about Sony products?”
- Entity extraction: Gemini Flash extracts brands: [“sony”], sentiment: negative (~$0.001, cached)
- Vector search: e5_base cosine similarity, top 30 results (no entity extraction, pure embedding match)
- Graph search: Cypher traversal using extracted entities, top 30 results
- RRF merge:
1/(k + rank_vector) + 1/(k + rank_graph)with k=60, top 10 returned - Cypher aggregation: Pre-computed counts from the full graph, passed to the LLM alongside retrieved reviews
The Cypher aggregation in step 5 traverses Brand → Product → Review → Negative Feature for “sony”, counts each feature, and returns “compatibility: 7, durability: 4, price: 3” in <1ms. This pre-computed answer is what the LLM receives alongside the retrieved reviews.
- Vector RAG encodes the question as an embedding and finds semantically similar documents. No entity extraction.
- Graph RAG additionally extracts entities from the question via LLM, feeds them into Cypher traversals, merges graph results with vector results via RRF, and computes aggregations for the LLM
Extraction cost by dataset size
Graph RAG’s additional cost over Vector RAG is entity extraction at index time:
Graph traversal at query time is free (self-hosted, <1ms). Entity extraction from the question costs ~$0.001 per query (cacheable). New documents are added incrementally.
Graph rag benchmark methodology
Dataset: 3,904 English electronics reviews from Amazon Reviews Multi (Kaggle), min 100 characters.
Embedding model: multilingual-e5-base (768-dim). Dense embeddings in Qdrant (in-memory).
Knowledge graph: 16,120 nodes, 23,940 edges. Entity extraction via Gemini 2.0 Flash (google/gemini-2.0-flash-001 on OpenRouter, $2.29 total). Neo4j as the graph database.
Query sets (905 total):
- Graph-structured (503): Generated from graph patterns. Tests graph traversal.
- Graph-agnostic (150): LLM-generated from review text. Tests natural language queries.
- External (252): LLM-generated independently of the graph. Five types: document lookup (65), brand aggregation (24), feature aggregation (50), brand comparison (50), global aggregation (60), plus 3 star-rating queries. Validates that graph advantages are not an artifact of graph-derived questions.
Pipelines:
Generation: Top-10 reviews passed to Gemini Flash. Graph RAG additionally passes Cypher aggregation. Fuzzy containment matching (threshold 0.80). Strict mode (0.90) preserves ordering: Graph RAG 68.9%, Vector RAG 13.5%.
Statistical validation: McNemar’s test, p < 0.001, Bonferroni-corrected for all generation accuracy pairs. Bootstrap 95% CIs: Graph RAG [68.1%, 79.0%], Vector RAG + metadata [34.4%, 47.4%].
Limitations
Generation context asymmetry. Graph RAG gets Cypher aggregation from the full graph. Vector RAG gets raw text. The fair context experiment shows the gap closes with equivalent context (74.0% vs 73.9%).
Graph-structured queries favor graph approaches. 503 of 905 queries derive from graph patterns. The 252 external queries still show graph advantages on aggregation.
Single domain, small dataset. 3,904 electronics reviews. Different domain, different schema, different results.
Schema dependency. Sentiment-labeled edges enable complaint/praise aggregation. Without them, these queries wouldn’t work.
Entity extraction quality. Not formally measured. The graph can only answer questions about entities it extracted.
Conclusion
Graph RAG is a computation layer on top of vector search, not a replacement for it. Graph RAG answers aggregation questions across thousands of documents in <1ms (73.5% generation accuracy vs 18.5%). When both pipelines receive the same structured context, the accuracy gap disappears (74.0% vs 73.9%).
The two engineering decisions that determine Graph RAG performance are schema design and entity extraction quality. The schema defines which questions are cheap. Entity extraction defines which entities the graph knows about. $2.29 in LLM calls for 3,904 documents.
Further reading
Explore other RAG benchmarks, such as:
- Embedding Models: OpenAI vs Gemini vs Cohere
- Top 16 Open Source Embedding Models for RAG
- Top Vector Database for RAG: Qdrant vs Weaviate vs Pinecone
- Reranker Benchmark: Top 8 Models Compared
- Multimodal Embedding Models: Apple vs Meta vs OpenAI
- Hybrid RAG: Boosting RAG Accuracy
- Top 10 Multilingual Embedding Models for RAG
Be the first to comment
Your email address will not be published. All fields are required.