Contact Us
No results found.

RAG Evaluation Tools: Weights & Biases vs Ragas vs DeepEval

Cem Dilmegani
Cem Dilmegani
updated on Mar 23, 2026

When a RAG pipeline retrieves the wrong context, the LLM confidently generates the wrong answer. Context relevance scorers are the primary defense.

We benchmarked five tools across 1,460 questions and 14,600+ scored contexts under identical conditions: same judge model (GPT-4o), default configurations, and no custom prompts. Under standard conditions, WandB, TruLens, and Ragas emerged as the top performers. Under adversarial pressure (entity-swapped hard negatives), WandB performed best.

RAG evaluation tools benchmark results

Loading Chart


The top three (WandB, TruLens, Ragas) are statistically tied on Top-1 Accuracy (95% CI overlapping between 94.0% and 98.0%).

To understand our evaluation and metrics in detail, see our benchmark methodology for the RAG evaluation tools.

Metrics explained

Top-1 accuracy: Can the tool assign the highest relevance score to the golden context? This measures safety against adversarial retrieval, a common failure mode in production.

NDCG@5 (normalized discounted cumulative gain): Given five contexts at different relevance levels (4, 3, 2, 1, 0), does the tool rank them in the correct order? Unlike binary accuracy, NDCG rewards tools that assign proportionally higher scores to more relevant contexts.

Spearman ρ (rank correlation): How well does a tool’s score ranking correlate with the ground truth relevance ordering? A perfect tool would produce ρ = 1.0.

MRR (mean reciprocal rank): Average of 1/rank for the golden context. If a tool ranks the golden context first, MRR = 1.0; second, MRR = 0.5; third, MRR = 0.33. Penalizes tools that bury the correct context below less relevant ones.

Key findings

  1. WandB leads on identification, TruLens leads on ranking: WandB has the highest Top-1 accuracy (94.5%) but the lowest NDCG@5 (0.910) and Spearman ρ (0.669). TruLens leads on NDCG@5 (0.932), Spearman ρ (0.750), and MRR (0.594). The difference comes down to scoring design: WandB’s binary scoring is simple but coarse; TruLens’ 4-point scale has more resolution but is more prone to inversions.
  2. TruLens has the highest discrimination ratio: When distinguishing a correct context from a near-identical entity-swapped version, TruLens gets the direction right 35.5% of the time with only 8.4% inversions (4.2:1 ratio). No other tool matches this.
  3. No tool distinguishes factually wrong from factually correct contexts: All five tools score hard negatives higher than partial contexts, inverting the correct relevance order. A passage with the right entities and the wrong answer consistently outscores a passage with the right topic but no answer. This is consistent with context relevance measuring topical fit, not factual accuracy.
  4. DeepEval under-scores golden contexts: DeepEval’s statement decomposition produces competitive rankings (NDCG@5 = 0.923) but scores golden contexts at mean 0.46 vs 0.82–0.91 for other tools. This makes it unreliable for identifying the single best context.
  5. UpTrain’s ternary scale limits discrimination: Three output values (0, 0.5, 1.0) cannot represent five relevance levels. UpTrain shows the worst discrimination ratio (1.4:1) and the lowest ranking accuracy (27.6% perfect ordering).

Discrimination: golden vs. hard negative

How often does the tool assign a higher score to the golden context than to the entity-swapped hard negative?

Win = golden scores strictly higher. Tie = equal scores. Loss = hard neg scores higher.

WandB has the fewest losses (4.8%) but also the fewest wins (15.5%): its binary scoring produces ties 80% of the time. When it does differentiate, it almost always gets the direction right. WandB’s strict Top-1 accuracy (golden is the unique maximum) is only 8.3%, compared to 25.3% for TruLens; its argmax Top-1 is high because the golden context is at index 0 and benefits from tie-breaking.

Ranking quality

Pairwise Acc = % of all 10 context pairs per sample ranked correctly. Top-2 Acc = highest-scored context is golden or partial. 5-Way Acc = perfect monotonic ranking across all 5 levels.

WandB leads on all three metrics because its binary scoring creates a natural two-tier split (relevant vs. irrelevant) that eliminates within-tier ordering errors. Note: pairwise accuracy counts ties as correct (s[i] >= s[j]), which benefits binary tools. NDCG@5 and Spearman ρ (shown in the chart above) penalize ties and rank TruLens first.

Average scores by relevance level

No tool correctly orders Partial > Hard Negative.

How each tool evaluates context relevance

All five tools use GPT-4o as their underlying judge, but they employ different evaluation strategies.

WandB Weave: Binary LLM prompt

WandB sends a single prompt to the LLM asking it to rate relevancy “on a scale from 0 to 1.” However, its internal response schema defines the score as an integer, so the model can only return 0 or 1.

One LLM call, one binary decision. WandB answers “is this the right context?” cleanly (highest Top-1 accuracy) but cannot express degrees of relevance: a partial context and a hard negative both get the same score.

Output values: 0, 1

TruLens: 4-point Likert scale

TruLens prompts the LLM as a “RELEVANCE grader” with explicit criteria for a 0-3 scale:

  • 0: Irrelevant to the query
  • 1: Relevant to some of the query
  • 2: Relevant to most of the query
  • 3: Relevant to the entirety of the query

The raw score is normalized to 0.0–1.0 by dividing by 3. This gives TruLens four distinct output levels, providing enough granularity to distinguish partial contexts from hard negatives while keeping the prompt simple.

Output values: 0.0, 0.33, 0.67, 1.0

Ragas: Dual-judge averaging

Ragas runs two independent judge prompts on every evaluation, each with a different phrasing of the same criteria (0 = irrelevant, 1 = partially relevant, 2 = fully relevant). The final score is the average of both judges, normalized to 0.0–1.0.

Because two 3-point scales are averaged, Ragas produces five possible values, more output values than any other tool tested. The dual-judge design also provides built-in resistance to prompt sensitivity.

Output values: 0.0, 0.25, 0.5, 0.75, 1.0

UpTrain: Ternary classification (A/B/C)

UpTrain frames relevance as a multiple-choice classification:

  • A (1.0): Context can answer the query completely
  • B (0.5): Context can give some relevant answer but can’t answer completely
  • C (0.0): Context doesn’t contain any information to answer the query

The ternary design can distinguish “partially relevant” from “irrelevant” but cannot separate “deceptive” from “tangentially related”; both may fall into the same bucket.

Output values: 0.0, 0.5, 1.0

DeepEval: Statement decomposition (G-Eval)

Instead of asking for a single relevance score, DeepEval decomposes the context into individual statements, then asks the LLM to verdict each statement as “yes” (relevant) or “no” (irrelevant) to the query. The final score is the ratio of relevant statements to total statements.

The result is a continuous score (e.g., 7 out of 10 statements relevant = 0.70). However, the approach is strict: even a highly relevant context gets penalized if it contains any off-topic sentences. Golden contexts sometimes include contextual details that the decomposition marks as “irrelevant,” dragging the score below that of a shorter, more focused hard negative. This explains DeepEval’s 78.1% Top-1 accuracy.

Output values: Continuous (0.0–1.0)

RAG evaluation tools benchmark methodology

Adversarial dataset design

Each query has five contexts at a distinct relevance level:

Dataset

We combine two sources:

HaluEval (480 samples): General knowledge questions spanning music, film, sports, history, geography, and more. Hard negatives, partial contexts, and soft negatives are generated by Claude.

HotPotQA (530 samples): Multi-hop reasoning questions requiring information synthesis across multiple documents.

Total: 1,010 samples, each with 5 contexts = 5,050 context evaluations per tool. All samples passed automated leak filtering (489 samples removed during generation for answer leakage).

Cross-model protocol

To eliminate self-preference bias (where an LLM evaluator prefers text generated by itself), we used Claude Sonnet 4.5 for adversarial context generation and GPT-4o as the judge for all tools. Both were called via OpenRouter with temperature=0.

The adversarial traps

The multi-hop trap (Relation confusion)

Questions often require tracing a relationship chain (e.g., A is related to B, who is related to C). Hard negatives answer a simpler version of the question, breaking the chain.

Question ID 89: “Who publishes the game series that Retro City Rampage is a parody of?” Target Answer: Rockstar Games

The entity distractor trap

Retrievers often find the correct location or subject, but return metadata about the wrong event or attribute.

Question ID 90: “…The Bridge Inn is the venue for which annual competition for telling lies, held in Cumbria, England?” Target Answer: World’s Biggest Liar

The partial relevance trap

A context with the right topic and entities but no answer.

Question ID 9: “Who wrote the lyrics of Portofino with a collaborator on ‘Fiddler on the Roof’?” Target Answer: Richard Ney

TruLens and DeepEval correctly score partial contexts higher than hard negatives on these samples specifically, though this pattern does not hold across the full dataset.

Which tool should you use?

Conclusion

Scoring granularity is the main tradeoff. Binary tools (WandB) win on identification because every tie defaults in their favor; multi-point tools (TruLens, Ragas) win on ranking because they can express degrees of relevance.

Context relevance works as a first-pass filter: all tools separate relevant from irrelevant contexts more than 91% of the time (pairwise accuracy). But none of them verify factual accuracy. A passage with the right entities and the wrong answer scores high across every tool tested. For factual correctness, pair with answer faithfulness metrics.

Limitations

  1. Single judge model: All evaluations use GPT-4o as the judge. Results may differ with other models.
  2. Context relevance only: This benchmark evaluates context relevance scoring only, not answer faithfulness or other RAG metrics.
  3. Default configurations: Tools were evaluated out-of-the-box. Performance may improve with custom prompt engineering.
  4. Single run with tie-breaking convention: The benchmark was executed once with temperature=0. Top-1 accuracy uses argmax (first index wins ties), which benefits tools with high tie rates (WandB: 86%). We report strict Top-1 alongside argmax where relevant.
  5. Adversarial-only dataset: All hard negatives use entity-swapping. Results reflect performance under adversarial conditions; tools may perform differently on naturally retrieved contexts.

Further reading

Explore other RAG benchmarks, such as:

Principal Analyst
Cem Dilmegani
Cem Dilmegani
Principal Analyst
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
View Full Profile
Researched by
Ekrem Sarı
Ekrem Sarı
AI Researcher
Ekrem is an AI Researcher at AIMultiple, focusing on intelligent automation, GPUs, AI Agents, and RAG frameworks.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

0/450