Vector RAG retrieves documents by semantic similarity. Graph RAG adds a knowledge graph on top of it, extracts entities and relationships from your documents, stores them in a graph database, and uses graph traversal alongside vector search at query time.
We benchmarked whether this extra layer improves retrieval and answer accuracy on 3,904 Amazon electronics reviews with nearly 900 queries.
Retrieval accuracy results by query type
Example questions:
- Specific search: “Find me a review about battery issues on this Bluetooth headset.”
- Entity aggregation: “What are the most common complaints about Sony products?”
- Cross-document reasoning: “What is the most common complaint across all electronics?”
Vector RAG finds specific documents better (54% vs 35%). Graph RAG retrieves relevant results for aggregation queries 3x more often (23% vs 8%) and for cross-document reasoning 4x more often (33% vs 8%).
The difference comes from how each pipeline handles the query: “Which brands have the most charging complaints?”
- Vector RAG encodes the question as an embedding and finds the 10 most similar reviews. The results are semantically related to “charging” but come from random brands.
- Graph RAG extracts “charging” and “complaint” from the question, traverses Brand → Product → Review → Negative Feature in the knowledge graph, and returns reviews grouped by brand. One query, <1ms.
What is Graph RAG
Graph RAG adds a knowledge graph on top of vector search. Vector search still runs. The graph adds structured computation that vector search cannot do.
At index time, each document goes through two paths. An LLM extracts entities (brands, features, sentiment) and stores them as nodes and edges in a graph database. The same documents are embedded and stored in a vector index.
At query time, the system retrieves from both paths: embedding similarity and graph traversal. The results merge into a single ranked list. For aggregation queries, the graph also pre-computes counts and groupings from the full dataset. This pre-computed aggregation is what separates Graph RAG from Vector RAG.
How the knowledge graph is built
An LLM reads each review and extracts entities and relationships. For example, from a single review:
“The Sony WH-1000XM4 has amazing noise cancellation, but the battery only lasts 20 hours, not 30 as advertised.”
The LLM extracts:
- Brand: Sony
- Product: WH-1000XM4
- Positive feature: noise cancellation
- Negative feature: battery life
3,904 reviews produce 16,120 nodes and 23,940 edges. The schema determines which questions can be answered by a single graph traversal
Separating HAS_POSITIVE and HAS_NEGATIVE makes “top complaints for Sony” a single traversal. Without sentiment-labeled edges, the LLM would read and classify every review at query time.
If “Sony” isn’t extracted, no graph traversal finds it. If “batteries” and “battery life” don’t resolve to the same node, the counts are wrong. Different domains need different schemas. A wrong schema means the graph adds complexity without capability.
How Graph RAG retrieves and generates answers
Example: “What are the most common complaints about Sony products?”
- Entity extraction: Gemini Flash extracts brands: [“sony”], sentiment: negative (~$0.001, cached)
- Vector search: e5_base cosine similarity, top 30 results (no entity extraction, pure embedding match)
- Graph search: Cypher traversal using extracted entities, top 30 results
- RRF merge:
1/(k + rank_vector) + 1/(k + rank_graph)with k=60, top 10 returned - Cypher aggregation: Pre-computed counts from the full graph, passed to the LLM alongside retrieved reviews
The Cypher aggregation in step 5 traverses Brand → Product → Review → Negative Feature for “sony”, counts each feature, and returns “compatibility: 7, durability: 4, price: 3” in <1ms. This pre-computed answer is what the LLM receives alongside the retrieved reviews.
- Vector RAG encodes the question as an embedding and finds semantically similar documents. No entity extraction.
- Graph RAG additionally extracts entities from the question via LLM, feeds them into Cypher traversals, merges graph results with vector results via RRF, and computes aggregations for the LLM
Extraction cost by dataset size
Graph RAG’s additional cost over Vector RAG is entity extraction at index time:
Graph traversal at query time is free (self-hosted, <1ms). Entity extraction from the question costs ~$0.001 per query (cacheable). New documents are added incrementally.
The accuracy gap is due to computation
We measured generation accuracy on 238 aggregation queries with and without the graph’s Cypher aggregation:
Default: Graph RAG gets Cypher aggregation computed from the full knowledge graph (3,904 reviews). Vector RAG gets raw review text.
No graph context: Both pipelines get raw review text only. No aggregation.
Without Cypher aggregation, Graph RAG drops from 73.5% to 23.1%, near Vector RAG at 18.5%. The 50pp gap was not retrieval. It was computation: the graph traverses, groups, and counts across the full dataset before the LLM generates an answer.
All generation differences significant at p < 0
Graph rag benchmark methodology
Dataset: 3,904 English electronics reviews from Amazon Reviews Multi (Kaggle), min 100 characters.
Embedding model: multilingual-e5-base (768-dim). Dense embeddings in Qdrant (in-memory).
Knowledge graph: 16,120 nodes, 23,940 edges. Entity extraction via Gemini 2.0 Flash (google/gemini-2.0-flash-001 on OpenRouter, $2.29 total). Neo4j as the graph database.
Query sets (905 total):
- Graph-structured (503): Generated from graph patterns. Tests graph traversal.
- Graph-agnostic (150): LLM-generated from review text. Tests natural language queries.
- External (252): LLM-generated independently of the graph. Five types: document lookup (65), brand aggregation (24), feature aggregation (50), brand comparison (50), global aggregation (60), plus 3 star-rating queries. Validates that graph advantages are not an artifact of graph-derived questions.
Pipelines:
Generation: Top-10 reviews passed to Gemini Flash. Graph RAG additionally passes Cypher aggregation. Fuzzy containment matching (threshold 0.80). Strict mode (0.90) preserves ordering: Graph RAG 68.9%, Vector RAG 13.5%.
Statistical validation: McNemar’s test, p < 0.001, Bonferroni-corrected. Bootstrap 95% CI for Graph RAG generation accuracy: 68.1%-79.0%.
Limitation
Single domain, small dataset: 3,904 electronics reviews. Different domain, different schema, different results.
Conclusion
Graph RAG works best when the question requires computation across many documents: aggregation, counting, grouping, and comparison. For these queries, it generates correct answers 73.5% of the time vs Vector RAG’s 18.5%. Without the graph’s computation, that gap disappears (23.1% vs 18.5%).
For specific document search, Vector RAG is better (54% vs 35%). Graph RAG is not a replacement for vector search. It is a computation layer on top of it.
The two engineering decisions that determine Graph RAG performance are schema design and entity extraction quality. The schema defines which questions can be answered by a single graph traversal. Entity extraction defines which entities the graph knows about. $2.29 in LLM calls for 3,904 documents.
Further reading
Explore other RAG benchmarks, such as:
- Embedding Models: OpenAI vs Gemini vs Cohere
- Top 16 Open Source Embedding Models for RAG
- Top Vector Database for RAG: Qdrant vs Weaviate vs Pinecone
- Reranker Benchmark: Top 8 Models Compared
- Multimodal Embedding Models: Apple vs Meta vs OpenAI
- Hybrid RAG: Boosting RAG Accuracy
- Top 10 Multilingual Embedding Models for RAG
Be the first to comment
Your email address will not be published. All fields are required.