Benchmark

HALC-Bench: LLM Hallucination on Long-Context Retrieval Benchmark

updated on Jul 7, 2026

HALC-Bench (LLM Hallucination on Long-Context Retrieval Benchmark) measures a large language model’s resistance to fabricating evidence for a metric that does not exist in the target document by using 3 haystacks placed at the beginning, middle, and end of the model’s context window, with 204 questions.

Results

HALC-Bench: Hallucination on Long-Context Retrieval

Loading Chart

claude-fable-5 answered all 204 traps correctly at every haystack position. Among the remaining models, gpt-5.5 hallucinated least. There is no correlation found between haystack place and hallucination rates.

Methodology

204 questions are prepared from Motley Fool articles with post knowledge-cutoff dates, and the haystacks are positioned in the 0.1, 0.5, and 0.9 of the model’s context window.

Models benchmarked, and their tested context windows in tokens are below:

anthropic/claude-fable-5: 850,000 tokens tested
openai/gpt-5.5: 1,000,000 tokens
google/gemini-3.1-pro-preview: 1,000,000 tokens
google/gemini-3.5-flash: 1,000,000 tokens
anthropic/claude-opus-4.8: 1,000,000 tokens advertised, 850,000 tested.
anthropic/claude-sonnet-4.6: 1,000,000 tokens
qwen/qwen3.6-plus: 1,000,000 tokens
moonshotai/kimi-k2.6: 200,000 tokens
z-ai/glm-5.1: 200,000 tokens
minimax/minimax-m2.7: 150,000 tokens
openai/gpt-5.4-mini: 250,000 tokens

claude-opus-4.8 is tested at the 850,000 tier because it cannot get the input of a 1,000,000-token context window tests successfully.

claude-fable-5 is tested through Claude Code: the model receives the 850,000-token haystack as a file and searches it with retrieval tools instead of reading it from its context window, so its scores measure the model together with the Claude Code harness.

Question format

A claim about a metric that is not discussed anywhere in the target transcript.

Example claim: The Scope 1 and 2 carbon emissions reported by DocuSign (DOCU) for Q4 2026 is 8,700 metric tons CO2e.

Expected answer: Not mentioned

Data source

Motley Fool transcripts published after the models’ knowledge cutoff date are used as the data source. Hand-authored traps based on each transcript’s actual content gaps. For each of the 14 source transcripts:

Manually identify metric categories absent from the transcript (e.g., DocuSign Q4 2026 never discusses ESG / carbon metrics; Adobe never breaks out APAC revenue; Lennar never reports R&D expense because it’s a homebuilder).
Construct a plausible-sounding claim with a realistic number, units, and quarter reference.
Programmatically verify absence via keyword search against body text. Each claim has 3–8 keyword variants (e.g., “carbon emissions,” “scope 1,” “scope 2,” “ghg,” “co2”); if any keyword hits the cleaned body, the trap is rejected as ambiguous.
Hand-review survivors to filter false negatives from the keyword check.

Why does this isolate hallucination?

The target document does not discuss the metric, but distractor documents in the haystack often do discuss similar metrics for other companies. A model that hallucinates will:

Either fabricate a number based on the distractors
Or claim the metric is mentioned with a wrong value (predicting no instead of not_mentioned)

Both failure modes register as score = 0. Only correctly answering “not mentioned” scores 1.0.

Scoring rule

Score = 1.0 if predicted == not_mentioned, else 0.0.

The most diagnostic error pattern is predicted = no when expected = not_mentioned. That means the model claims to have seen the metric but with a wrong value. It fabricated evidence of presence.

Trap distribution by source transcript

~17 traps per transcript across 14 source transcripts spanning 10 industries (semiconductors, SaaS, retail, restaurants, CPG, homebuilding, finance, food production, enterprise hardware, and others) designed so the test doesn’t measure hallucination on a single domain.
A total of 204 distinct questions are used in the benchmark, positioned across different haystack positions within the context window.

Cite this benchmark

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Cem Dilmegani (2026) - "HALC-Bench: LLM Hallucination on Long-Context Retrieval Benchmark". Published online at AIMultiple.com. Retrieved July 7, 2026, from: https://aimultiple.com/ai-hallucination [Online Resource]

Dilmegani, C. (2026, July 7). HALC-Bench: LLM Hallucination on Long-Context Retrieval Benchmark. AIMultiple. https://aimultiple.com/ai-hallucination

@misc{dilmegani2026,
  author = {Dilmegani, Cem},
  title  = {{HALC-Bench: LLM Hallucination on Long-Context Retrieval Benchmark}},
  year   = {2026},
  month  = jul,
  howpublished    = {\url{https://aimultiple.com/ai-hallucination}},
  note   = {AIMultiple. Retrieved July 7, 2026}
}

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Comments 4

Share Your Thoughts

Your email address will not be published. All fields are required. Comments are left in their original language.

Abraham

Aug 25, 2025 at 11:57

This article is updated in June while the GPT 5 is announced in August. How did you test GPT 5 in AI Hallucination Rates figure

Aleyna Daldal

Sep 05, 2025 at 08:46

Hi! Thanks for your comment. We use WordPress for our articles, which allows us to update graphs and tables independently of the main text. This means that even if the article text shows an earlier update date, we can still add the latest results to the figures without altering the written sections.

Rui

Aug 08, 2025 at 20:31

Hi Cem, I've been using this article as a reference of severity of hallucination. Is it possible to refresh the report with the newly released GPT-5? Thanks!

Aleyna Daldal

Sep 05, 2025 at 08:48

Hi Rui, Thanks a lot for your interest and for using our article as a reference. We’ve already refreshed the report with GPT-5 results, so you’ll find the latest updates included in the article.

Tim

Jul 19, 2025 at 10:13

Is there any chance that you might add Claude Sonnet/Opus 4 as well as Gemini 2.5 Pro?

Aleyna Daldal

Sep 05, 2025 at 08:48

Hi Tim, Thank you for your support and suggestion. Claude Sonnet/Opus 4 and Gemini 2.5 Pro have already been added to the article, so you can now see them included in the comparisons.

Joon

Feb 28, 2025 at 16:29

Hi, thank you for interesting benchmark! I was wondering Grok3's hallucination rate, both in Think mode and without. Are you planning to add these?