RAG Evaluation Tools: Weights & Biases vs Ragas vs DeepEval

Cem Dilmegani

with

Ekrem Sarı

updated on Mar 23, 2026

See our ethical norms

Cite This Research

When a RAG pipeline retrieves the wrong context, the LLM confidently generates the wrong answer. Context relevance scorers are the primary defense.

We benchmarked five tools across 1,460 questions and 14,600+ scored contexts under identical conditions: same judge model (GPT-4o), default configurations, and no custom prompts. Under standard conditions, WandB, TruLens, and Ragas emerged as the top performers. Under adversarial pressure (entity-swapped hard negatives), WandB performed best.

RAG evaluation tools benchmark results

Select Metric:

Loading Chart

The top three (WandB, TruLens, Ragas) are statistically tied on Top-1 Accuracy (95% CI overlapping between 94.0% and 98.0%).

To understand our evaluation and metrics in detail, see our benchmark methodology for the RAG evaluation tools.

Metrics explained

Top-1 accuracy: Can the tool assign the highest relevance score to the golden context? This measures safety against adversarial retrieval, a common failure mode in production.

NDCG@5 (normalized discounted cumulative gain): Given five contexts at different relevance levels (4, 3, 2, 1, 0), does the tool rank them in the correct order? Unlike binary accuracy, NDCG rewards tools that assign proportionally higher scores to more relevant contexts.

Spearman ρ (rank correlation): How well does a tool’s score ranking correlate with the ground truth relevance ordering? A perfect tool would produce ρ = 1.0.

MRR (mean reciprocal rank): Average of 1/rank for the golden context. If a tool ranks the golden context first, MRR = 1.0; second, MRR = 0.5; third, MRR = 0.33. Penalizes tools that bury the correct context below less relevant ones.

Key findings

WandB leads on identification, TruLens leads on ranking: WandB has the highest Top-1 accuracy (94.5%) but the lowest NDCG@5 (0.910) and Spearman ρ (0.669). TruLens leads on NDCG@5 (0.932), Spearman ρ (0.750), and MRR (0.594). The difference comes down to scoring design: WandB’s binary scoring is simple but coarse; TruLens’ 4-point scale has more resolution but is more prone to inversions.
TruLens has the highest discrimination ratio: When distinguishing a correct context from a near-identical entity-swapped version, TruLens gets the direction right 35.5% of the time with only 8.4% inversions (4.2:1 ratio). No other tool matches this.
No tool distinguishes factually wrong from factually correct contexts: All five tools score hard negatives higher than partial contexts, inverting the correct relevance order. A passage with the right entities and the wrong answer consistently outscores a passage with the right topic but no answer. This is consistent with context relevance measuring topical fit, not factual accuracy.
DeepEval under-scores golden contexts: DeepEval’s statement decomposition produces competitive rankings (NDCG@5 = 0.923) but scores golden contexts at mean 0.46 vs 0.82–0.91 for other tools. This makes it unreliable for identifying the single best context.
UpTrain’s ternary scale limits discrimination: Three output values (0, 0.5, 1.0) cannot represent five relevance levels. UpTrain shows the worst discrimination ratio (1.4:1) and the lowest ranking accuracy (27.6% perfect ordering).

Discrimination: golden vs. hard negative

How often does the tool assign a higher score to the golden context than to the entity-swapped hard negative?

Tool	Mean Gap	Win %	Tie %	Loss %	W:L Ratio
TruLens	+0.145	35.5%	56.0%	8.4%	4.2:1
WandB Weave	+0.108	15.5%	79.7%	4.8%	3.3:1
Ragas	+0.101	29.1%	60.8%	10.1%	2.9:1
DeepEval	+0.084	31.2%	56.7%	12.0%	2.6:1
UpTrain	+0.048	20.9%	63.7%	15.4%	1.4:1

Win = golden scores strictly higher. Tie = equal scores. Loss = hard neg scores higher.

WandB has the fewest losses (4.8%) but also the fewest wins (15.5%): its binary scoring produces ties 80% of the time. When it does differentiate, it almost always gets the direction right. WandB’s strict Top-1 accuracy (golden is the unique maximum) is only 8.3%, compared to 25.3% for TruLens; its argmax Top-1 is high because the golden context is at index 0 and benefits from tie-breaking.

Ranking quality

Tool	Pairwise Acc	Top-2 Acc	5-Way Acc
WandB Weave	94.4%	96.1%	50.3%
TruLens	92.7%	92.4%	37.6%
DeepEval	92.1%	90.0%	39.7%
Ragas	91.9%	90.3%	32.3%
UpTrain	91.0%	84.9%	27.6%

Pairwise Acc = % of all 10 context pairs per sample ranked correctly. Top-2 Acc = highest-scored context is golden or partial. 5-Way Acc = perfect monotonic ranking across all 5 levels.

WandB leads on all three metrics because its binary scoring creates a natural two-tier split (relevant vs. irrelevant) that eliminates within-tier ordering errors. Note: pairwise accuracy counts ties as correct (s[i] >= s[j]), which benefits binary tools. NDCG@5 and Spearman ρ (shown in the chart above) penalize ties and rank TruLens first.

Average scores by relevance level

Tool	Golden (4)	Partial (3)	Hard Neg (2)	Soft Neg (1)	Correct order?
TruLens	0.904	0.423	0.759	0.017	No
Ragas	0.905	0.419	0.804	0.032	No
WandB	0.893	0.386	0.785	0.017	No
DeepEval	0.457	0.323	0.372	0.014	No
UpTrain	0.820	0.253	0.772	0.017	No

No tool correctly orders Partial > Hard Negative.

How each tool evaluates context relevance

All five tools use GPT-4o as their underlying judge, but they employ different evaluation strategies.

WandB Weave: Binary LLM prompt

WandB sends a single prompt to the LLM asking it to rate relevancy “on a scale from 0 to 1.” However, its internal response schema defines the score as an integer, so the model can only return 0 or 1.

One LLM call, one binary decision. WandB answers “is this the right context?” cleanly (highest Top-1 accuracy) but cannot express degrees of relevance: a partial context and a hard negative both get the same score.

Output values: 0, 1

TruLens: 4-point Likert scale

TruLens prompts the LLM as a “RELEVANCE grader” with explicit criteria for a 0-3 scale:

0: Irrelevant to the query
1: Relevant to some of the query
2: Relevant to most of the query
3: Relevant to the entirety of the query

The raw score is normalized to 0.0–1.0 by dividing by 3. This gives TruLens four distinct output levels, providing enough granularity to distinguish partial contexts from hard negatives while keeping the prompt simple.

Output values: 0.0, 0.33, 0.67, 1.0

Ragas: Dual-judge averaging

Ragas runs two independent judge prompts on every evaluation, each with a different phrasing of the same criteria (0 = irrelevant, 1 = partially relevant, 2 = fully relevant). The final score is the average of both judges, normalized to 0.0–1.0.

Because two 3-point scales are averaged, Ragas produces five possible values, more output values than any other tool tested. The dual-judge design also provides built-in resistance to prompt sensitivity.

Output values: 0.0, 0.25, 0.5, 0.75, 1.0

UpTrain: Ternary classification (A/B/C)

UpTrain frames relevance as a multiple-choice classification:

A (1.0): Context can answer the query completely
B (0.5): Context can give some relevant answer but can’t answer completely
C (0.0): Context doesn’t contain any information to answer the query

The ternary design can distinguish “partially relevant” from “irrelevant” but cannot separate “deceptive” from “tangentially related”; both may fall into the same bucket.

Output values: 0.0, 0.5, 1.0

DeepEval: Statement decomposition (G-Eval)

Instead of asking for a single relevance score, DeepEval decomposes the context into individual statements, then asks the LLM to verdict each statement as “yes” (relevant) or “no” (irrelevant) to the query. The final score is the ratio of relevant statements to total statements.

The result is a continuous score (e.g., 7 out of 10 statements relevant = 0.70). However, the approach is strict: even a highly relevant context gets penalized if it contains any off-topic sentences. Golden contexts sometimes include contextual details that the decomposition marks as “irrelevant,” dragging the score below that of a shorter, more focused hard negative. This explains DeepEval’s 78.1% Top-1 accuracy.

Output values: Continuous (0.0–1.0)

See more of our benchmarks and data-driven insights in Google Search.

Add as preferred source

RAG evaluation tools benchmark methodology

Adversarial dataset design

Each query has five contexts at a distinct relevance level:

Level	Name	Description	What it tests
4	Golden Context	Paraphrased: answer is inferable but not stated verbatim	Can the tool recognize a correct answer even when rephrased?
3	Partial Context	Same topic and entities, but answer is missing	Can the tool distinguish “relevant” from “sufficient”?
2	Hard Negative	Entity-swapped clone of golden: same structure, wrong answer fact	Can the tool detect a single swapped fact?
1	Soft Negative	Same broad domain, different specific topic	Can the tool distinguish topical similarity from relevance?
0	Easy Negative	Completely unrelated domain	Does the tool reject obvious irrelevance?

Dataset

We combine two sources:

HaluEval (480 samples): General knowledge questions spanning music, film, sports, history, geography, and more. Hard negatives, partial contexts, and soft negatives are generated by Claude.

HotPotQA (530 samples): Multi-hop reasoning questions requiring information synthesis across multiple documents.

Total: 1,010 samples, each with 5 contexts = 5,050 context evaluations per tool. All samples passed automated leak filtering (489 samples removed during generation for answer leakage).

Cross-model protocol

To eliminate self-preference bias (where an LLM evaluator prefers text generated by itself), we used Claude Sonnet 4.5 for adversarial context generation and GPT-4o as the judge for all tools. Both were called via OpenRouter with temperature=0.

The adversarial traps

The multi-hop trap (Relation confusion)

Questions often require tracing a relationship chain (e.g., A is related to B, who is related to C). Hard negatives answer a simpler version of the question, breaking the chain.

Question ID 89: “Who publishes the game series that Retro City Rampage is a parody of?” Target Answer: Rockstar Games

Context Type	Content Snippet	Analysis
Golden Context (4)	…Retro City Rampage… is a parody of… ‘Grand Theft Auto’… It is primarily developed by Rockstar North… and published by Rockstar Games.	Correctly follows the logic: Game → Parodies → GTA → GTA Publisher = Rockstar.
Hard Negative (2)	…Retro City Rampage is an open-world action-adventure game… functions as a parody of the 8-bit era… When inquiring about who publishes the game, the answer is the developer themselves: Vblank Entertainment self-published the title…	The Trap: The text is highly relevant to “Retro City Rampage” and “Publishing.” It even explicitly names a publisher (Vblank). However, it answers “Who publishes Retro City Rampage?”, ignoring the critical “series that it is a parody of” clause.

The entity distractor trap

Retrievers often find the correct location or subject, but return metadata about the wrong event or attribute.

Question ID 90: “…The Bridge Inn is the venue for which annual competition for telling lies, held in Cumbria, England?” Target Answer: World’s Biggest Liar

Context Type	Content Snippet	Analysis
Golden Context (4)	…The Bridge Inn is the venue for the annual World’s Biggest Liar competition… held in Cumbria, England.	Contains the exact entity requested.
Hard Negative (2)	Santon Bridge is a picturesque hamlet… The village’s local pub, The Bridge Inn, serves as a central hub… The Bridge Inn hosts its own distinct calendar of events. These include the popular annual Cumbrian Ale Festival and the local ‘Best Leek’ gardening competition…	The Trap: The context contains almost all the keywords: “Santon Bridge,” “Cumbria,” “The Bridge Inn,” “Annual Competition.” It describes the correct venue perfectly. However, it lists different competitions (Ale Festival, Leek competition) instead of the “Lying” competition.

The partial relevance trap

A context with the right topic and entities but no answer.

Question ID 9: “Who wrote the lyrics of Portofino with a collaborator on ‘Fiddler on the Roof’?” Target Answer: Richard Ney

Context Type	Content Snippet	Analysis
Golden Context (4)	Portofino is a musical with a book by Richard Ney, lyrics by Ney and Sheldon Harnick, and music by Louis Bellson and Will Irwin. Sheldon Harnick is an American lyricist best known for his collaborations with composer Jerry Bock on musicals such as “Fiddler on the Roof”.	Names the answer (Richard Ney) explicitly and connects Harnick to Fiddler on the Roof.
Partial Context (3)	Portofino is a musical with lyrics by Sheldon Harnick and another collaborator, and music by Louis Bellson and Will Irwin. The show featured a book and additional lyrics by a writer whose identity has been noted in various theatrical archives…	The Trap: Discusses the exact same musical, names the same collaborators, but the answer (Richard Ney) is deliberately omitted. A tool should recognize this as relevant but insufficient.

TruLens and DeepEval correctly score partial contexts higher than hard negatives on these samples specifically, though this pattern does not hold across the full dataset.

Which tool should you use?

Use case	Recommended tool	Why
“Is this the right context?” (binary safety check)	WandB Weave	Highest Top-1 accuracy (94.5%).
“Rank these contexts by relevance” (retrieval quality)	TruLens or Ragas	Best NDCG@5 and Spearman ρ. Multi-point scales express relevance gradations that binary tools cannot.
High-risk deployments (retrieval poisoning concern)	WandB + TruLens together	WandB as a fast binary filter, TruLens as a second-pass ranker.
Budget-constrained / high-volume	WandB Weave	One LLM call per evaluation (vs. two for Ragas). Binary output is cheapest to produce.

Conclusion

Scoring granularity is the main tradeoff. Binary tools (WandB) win on identification because every tie defaults in their favor; multi-point tools (TruLens, Ragas) win on ranking because they can express degrees of relevance.

Context relevance works as a first-pass filter: all tools separate relevant from irrelevant contexts more than 91% of the time (pairwise accuracy). But none of them verify factual accuracy. A passage with the right entities and the wrong answer scores high across every tool tested. For factual correctness, pair with answer faithfulness metrics.

Limitations

Single judge model: All evaluations use GPT-4o as the judge. Results may differ with other models.
Context relevance only: This benchmark evaluates context relevance scoring only, not answer faithfulness or other RAG metrics.
Default configurations: Tools were evaluated out-of-the-box. Performance may improve with custom prompt engineering.
Single run with tie-breaking convention: The benchmark was executed once with temperature=0. Top-1 accuracy uses argmax (first index wins ties), which benefits tools with high tie rates (WandB: 86%). We report strict Top-1 alongside argmax where relevant.
Adversarial-only dataset: All hard negatives use entity-swapping. Results reflect performance under adversarial conditions; tools may perform differently on naturally retrieved contexts.

Cite this research

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Cem Dilmegani and Ekrem Sarı (2026) - "RAG Evaluation Tools: Weights & Biases vs Ragas vs DeepEval". Published online at AIMultiple.com. Retrieved March 23, 2026, from: https://aimultiple.com/rag-evaluation-tools [Online Resource]

Dilmegani, C., & Sarı, E. (2026, March 23). RAG Evaluation Tools: Weights & Biases vs Ragas vs DeepEval. AIMultiple. https://aimultiple.com/rag-evaluation-tools

@misc{dilmegani2026,
  author = {Dilmegani, Cem and Sarı, Ekrem},
  title  = {{RAG Evaluation Tools: Weights & Biases vs Ragas vs DeepEval}},
  year   = {2026},
  month  = mar,
  howpublished    = {\url{https://aimultiple.com/rag-evaluation-tools}},
  note   = {AIMultiple. Retrieved March 23, 2026}
}

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Researched by