HALC-Bench (Hallucination on Long-Context Retrieval Benchmark) measures the model’s resistance to fabricating evidence for a metric that does not exist in the target document, by using 3 haystacks placed at the beginning, middle, and end of the model’s context window.
Results
gpt-5.5 is the least hallucinated model in this benchmark. There is no correlation found between haystack place and hallucination rates.
Methodology
204 questions are prepared from Motley Fool articles with post knowledge-cutoff dates, and the haystacks are positioned in the 0.1, 0.5, and 0.9 of the model’s context window.
Models benchmarked, and their tested context windows in tokens are below:
- openai/gpt-5.5: 1,000,000 tokens
- google/gemini-3.1-pro-preview: 1,000,000 tokens
- google/gemini-3.5-flash: 1,000,000 tokens
- anthropic/claude-sonnet-4.6: 1,000,000 tokens
- qwen/qwen3.6-plus: 1,000,000 tokens
- moonshotai/kimi-k2.6: 200,000 tokens
- z-ai/glm-5.1: 200,000 tokens
- minimax/minimax-m2.7: 150,000 tokens
- openai/gpt-5.4-mini: 250,000 tokens
Question format
A claim about a metric that is not discussed anywhere in the target transcript:
Example claim: The Scope 1 and 2 carbon emissions reported by DocuSign (DOCU) for Q4 2026 is 8,700 metric tons CO2e.
Expected answer: Not mentioned
Data source
Motley Fool transcripts published after the models’ knowledge cutoff date are used as the data source. Hand-authored traps based on each transcript’s actual content gaps. For each of the 14 source transcripts:
- Manually identify metric categories absent from the transcript (e.g., DocuSign Q4 2026 never discusses ESG / carbon metrics; Adobe never breaks out APAC revenue; Lennar never reports R&D expense because it’s a homebuilder).
- Construct a plausible-sounding claim with a realistic number, units, and quarter reference.
- Programmatically verify absence via keyword search against body text. Each claim has 3–8 keyword variants (e.g., “carbon emissions,” “scope 1,” “scope 2,” “ghg,” “co2”); if any keyword hits the cleaned body, the trap is rejected as ambiguous.
- Hand-review survivors to filter false negatives from the keyword check.
Why does this isolated hallucination?
The target document does not discuss the metric, but distractor documents in the haystack often do discuss similar metrics for other companies. A model that hallucinates will:
- Either fabricate a number based on the distractors
- Or claim the metric is mentioned with a wrong value (predicting no instead of not_mentioned)
Both failure modes register as score = 0. Only correctly answering NOT MENTIONED scores 1.0.
Scoring rule
Score = 1.0 if predicted == not_mentioned, else 0.0.
The most diagnostic error pattern is predicted = no when expected = not_mentioned. That means the model claims to have seen the metric but with a wrong value. It fabricated evidence of presence.
Trap distribution by source transcript
~17 traps per transcript across 14 source transcripts spanning 10 industries (semiconductors, SaaS, retail, restaurants, CPG, homebuilding, finance, food production, enterprise hardware, and others) designed so the test doesn’t measure hallucination on a single domain.
A total of 204 distinct questions are used in the benchmark, positioned across different haystack positions within the context window.
Further readings
- AI Code Benchmark: LMC-Eval
- LLM Pricing: Major Providers Compared
- AI Memory Benchmark
- AI Agent Performance: Success Rates & ROI
Comments 4
Share Your Thoughts
Your email address will not be published. All fields are required.
This article is updated in June while the GPT 5 is announced in August. How did you test GPT 5 in AI Hallucination Rates figure
Hi! Thanks for your comment. We use WordPress for our articles, which allows us to update graphs and tables independently of the main text. This means that even if the article text shows an earlier update date, we can still add the latest results to the figures without altering the written sections.
Hi Cem, I've been using this article as a reference of severity of hallucination. Is it possible to refresh the report with the newly released GPT-5? Thanks!
Hi Rui, Thanks a lot for your interest and for using our article as a reference. We’ve already refreshed the report with GPT-5 results, so you’ll find the latest updates included in the article.
Is there any chance that you might add Claude Sonnet/Opus 4 as well as Gemini 2.5 Pro?
Hi Tim, Thank you for your support and suggestion. Claude Sonnet/Opus 4 and Gemini 2.5 Pro have already been added to the article, so you can now see them included in the comparisons.
Hi, thank you for interesting benchmark! I was wondering Grok3's hallucination rate, both in Think mode and without. Are you planning to add these?
Hi Joon and thank you for your comment, Yes, we are waiting for API access.