Services
Contact Us
No results found.

HALC-Bench: Hallucination on Long-Context Retrieval Benchmark

Şevval Alper
Şevval Alper
updated on May 26, 2026

HALC-Bench (Hallucination on Long-Context Retrieval Benchmark) measures the model’s resistance to fabricating evidence for a metric that does not exist in the target document, by using 3 haystacks placed at the beginning, middle, and end of the model’s context window.

Results

Loading Chart

gpt-5.5 is the least hallucinated model in this benchmark. There is no correlation found between haystack place and hallucination rates.

Methodology

204 questions are prepared from Motley Fool articles with post knowledge-cutoff dates, and the haystacks are positioned in the 0.1, 0.5, and 0.9 of the model’s context window.

Models benchmarked, and their tested context windows in tokens are below:

  • openai/gpt-5.5: 1,000,000 tokens
  • google/gemini-3.1-pro-preview: 1,000,000 tokens
  • google/gemini-3.5-flash: 1,000,000 tokens
  • anthropic/claude-sonnet-4.6: 1,000,000 tokens
  • qwen/qwen3.6-plus: 1,000,000 tokens
  • moonshotai/kimi-k2.6: 200,000 tokens
  • z-ai/glm-5.1: 200,000 tokens
  • minimax/minimax-m2.7: 150,000 tokens
  • openai/gpt-5.4-mini: 250,000 tokens

Question format

A claim about a metric that is not discussed anywhere in the target transcript:

Example claim: The Scope 1 and 2 carbon emissions reported by DocuSign (DOCU) for Q4 2026 is 8,700 metric tons CO2e.

Expected answer: Not mentioned

Data source

Motley Fool transcripts published after the models’ knowledge cutoff date are used as the data source. Hand-authored traps based on each transcript’s actual content gaps. For each of the 14 source transcripts:

  • Manually identify metric categories absent from the transcript (e.g., DocuSign Q4 2026 never discusses ESG / carbon metrics; Adobe never breaks out APAC revenue; Lennar never reports R&D expense because it’s a homebuilder).
  • Construct a plausible-sounding claim with a realistic number, units, and quarter reference.
  • Programmatically verify absence via keyword search against body text. Each claim has 3–8 keyword variants (e.g., “carbon emissions,” “scope 1,” “scope 2,” “ghg,” “co2”); if any keyword hits the cleaned body, the trap is rejected as ambiguous.
  • Hand-review survivors to filter false negatives from the keyword check.

Why does this isolated hallucination?

The target document does not discuss the metric, but distractor documents in the haystack often do discuss similar metrics for other companies. A model that hallucinates will:

  • Either fabricate a number based on the distractors
  • Or claim the metric is mentioned with a wrong value (predicting no instead of not_mentioned)

Both failure modes register as score = 0. Only correctly answering NOT MENTIONED scores 1.0.

Scoring rule

Score = 1.0 if predicted == not_mentioned, else 0.0.

The most diagnostic error pattern is predicted = no when expected = not_mentioned. That means the model claims to have seen the metric but with a wrong value. It fabricated evidence of presence.

Trap distribution by source transcript

~17 traps per transcript across 14 source transcripts spanning 10 industries (semiconductors, SaaS, retail, restaurants, CPG, homebuilding, finance, food production, enterprise hardware, and others) designed so the test doesn’t measure hallucination on a single domain.
A total of 204 distinct questions are used in the benchmark, positioned across different haystack positions within the context window.

Further readings

Şevval Alper
Şevval Alper
AI Researcher
Şevval is an AIMultiple AI researcher specializing in LLMs, AI agents and quantum technologies.
View Full Profile
Technically reviewed by
Berk Kalelioğlu
Berk Kalelioğlu
AI Researcher
Berk is an AI Researcher at AIMultiple, focusing on agentic ai systems and language models.
View Full Profile

Comments 4

Share Your Thoughts

Your email address will not be published. All fields are required.

0/450
Abraham
Abraham
Aug 25, 2025 at 11:57

This article is updated in June while the GPT 5 is announced in August. How did you test GPT 5 in AI Hallucination Rates figure

Aleyna Daldal
Aleyna Daldal
Sep 05, 2025 at 08:46

Hi! Thanks for your comment. We use WordPress for our articles, which allows us to update graphs and tables independently of the main text. This means that even if the article text shows an earlier update date, we can still add the latest results to the figures without altering the written sections.

Rui
Rui
Aug 08, 2025 at 20:31

Hi Cem, I've been using this article as a reference of severity of hallucination. Is it possible to refresh the report with the newly released GPT-5? Thanks!

Aleyna Daldal
Aleyna Daldal
Sep 05, 2025 at 08:48

Hi Rui, Thanks a lot for your interest and for using our article as a reference. We’ve already refreshed the report with GPT-5 results, so you’ll find the latest updates included in the article.

Tim
Tim
Jul 19, 2025 at 10:13

Is there any chance that you might add Claude Sonnet/Opus 4 as well as Gemini 2.5 Pro?

Aleyna Daldal
Aleyna Daldal
Sep 05, 2025 at 08:48

Hi Tim, Thank you for your support and suggestion. Claude Sonnet/Opus 4 and Gemini 2.5 Pro have already been added to the article, so you can now see them included in the comparisons.

Joon
Joon
Feb 28, 2025 at 16:29

Hi, thank you for interesting benchmark! I was wondering Grok3's hallucination rate, both in Think mode and without. Are you planning to add these?

Cem Dilmegani
Cem Dilmegani
Mar 17, 2025 at 02:52

Hi Joon and thank you for your comment, Yes, we are waiting for API access.