AI
Explore practical insights, research, and benchmarks on artificial intelligence, including generative AI, large language models, RAG, governance frameworks, MLOps practices, and AI hardware. Gain an understanding of key tools, implementation strategies, and enterprise use cases shaping the AI landscape.
8 AI Code Models Benchmarked: LMC-Eval
More than 37% of tasks performed on AI models are about computer programming and maths.
OCR Benchmark: Text Extraction / Capture Accuracy
OCR accuracy is critical for many document processing tasks, and SOTA multi-modal LLMs are now offering an alternative to OCR.
Text-to-Video Generator Benchmark
A text-to-video generator is an AI system that turns written prompts into short videos by generating visuals, motion, and sometimes audio directly from natural language.
AI Hallucination Detection Tools: W&B Weave & Comet
We benchmarked three hallucination detection tools: Weights & Biases (W&B) Weave HallucinationFree Scorer, Arize Phoenix HallucinationEvaluator, and Comet Opik Hallucination Metric, across 100 test cases. Each tool was evaluated on accuracy, precision, recall, and latency to provide a fair comparison of their real-world performance.
Receipt OCR Benchmark with LLMs
Extracting data from receipts is essential for businesses, as millions of employees submit their work-related expenses via receipts. With the latest developments in generative AI and large language models, data extraction accuracy has reached a level comparable to that of humans.
LLM Parameters: GPT-5 High, Medium, Low and Minimal
New LLMs, such as OpenAI’s GPT-5 family, come in different versions (e.g., GPT-5, GPT-5-mini, and GPT-5-nano) and with various parameter settings, including high, medium, low, and minimal. Below, we explore the differences between these model versions by gathering their benchmark performance and the costs to run the benchmarks. Price vs.
GPU Software for AI: CUDA vs. ROCm in 2026
Raw hardware specifications tell only half the story in GPU computing. To measure real-world AI performance, we ran 52 distinct tests comparing AMD’s MI300X with NVIDIA’s H100, H200, and B200 across multi-GPU and high-concurrency scenarios.
Invoice OCR Benchmark: Extraction Accuracy of LLMs vs OCRs
Invoice processing is a critical yet labor-intensive business operation that traditionally requires manual data extraction and entry into accounting systems. This manual approach is time-consuming and susceptible to human error.
Speech-to-Text Benchmark: Deepgram vs. Whisper
We benchmarked the leading speech-to-text (STT) providers, focusing specifically on healthcare applications. Our benchmark used real-world examples to assess transcription accuracy in medical contexts, where precision is crucial. Speech-to-text benchmark results Based on both word error rate (WER) and character error rate (CER) results, GPT-4o-transcribe demonstrates the highest transcription accuracy among all evaluated speech-to-text systems.
Bias in AI: Examples and 6 Ways to Fix it in 2026
Interest in AI is increasing as businesses witness its benefits in AI use cases. However, there are valid concerns surrounding AI technology: AI bias benchmark To see if there would be any biases that could arise from the question format, we tested the same questions in both open-ended and multiple-choice formats.