Berk Kalelioğlu
Research interests
Berk focuses on machine learning, agentic AI tools, and large and small language models (LLMs and SLMs).He is part of the AIMultiple benchmark team, conducting assessments and providing insights to help readers understand emerging technologies and their real-world applications.
Professional experience
He began his career as a Tech Project Lead at ODTU IVME-R, where he led a project to build physical quantum and pseudorandom number generators.After his tenure at IVME-R, he co-founded a game development company and released a game on Steam.
He later shifted his career toward AI and joined AIMultiple as a Researcher.
Education
Berk holds a Bachelor’s degree in Mathematics from Ankara University.Latest Articles from Berk
A-CODE-LLM Bench: Agentic Coding Benchmark
We benchmarked the top Large Language Models (LLMs) across 10 software development tasks using an agentic CLI tool. We executed ~3,500 automated validation steps per model across both API and UI layers. A-CODE-LLM Bench results Each alias ran 3 times across 10 tasks (30 samples per alias, 270 cells per iteration).
HALC-Bench: LLM Hallucination on Long-Context Retrieval Benchmark
HALC-Bench (LLM Hallucination on Long-Context Retrieval Benchmark) measures a large language model’s resistance to fabricating evidence for a metric that does not exist in the target document by using 3 haystacks placed at the beginning, middle, and end of the model’s context window, with 204 questions. Results gpt-5.
Agentic CLI Benchmark: Codex Wins, Kiro Fastest
Agentic CLI tools are AI coding tools that can create and delete files, run commands, plan, and execute the coding of the entire project.
VELC-Bench: Verification on Long Context Benchmark
The model’s ability to locate a specific metric in context, compare its value to a claim, and confirm or reject it. This tests fine-grained value matching under long-context conditions. The model must both retrieve the value and perform a precise comparison.
RELC-Bench: Retrieval on Long Context Benchmark
RELC-Bench (RELC-Bench: Retrieval on Long Context Benchmark) aims to measure a model’s ability to find and extract a specific numeric value from one or more documents within its context. It tests whether the model can remember and retrieve a specific fact it just saw in the input.
Tabular Models Benchmark: Performance Across 19 Datasets 2026
We benchmarked 7 widely used tabular learning models to identify top-performing model families across 19 real-world datasets of varying sizes and structures, covering ~260,000 samples and over 250 total features, with dataset sizes ranging from 435 to nearly 49,000 rows. Tabular learning models benchmark results In the chart, the winning model receives 1 point.
VPS Benchmark: Hetzner vs Digital Ocean
We benchmarked 6 Virtual Private Server (VPS) providers by running ~1,200 automated tests per server across CPU, memory, disk I/O, and network speed using sysbench, fio, and speedtest-cli. We also documented the full signup-to-SSH experience for each provider.
RL Environments: The Infrastructure Behind Agentic AI
Reinforcement learning environments are controlled environments where AI agents take actions, observe outcomes, and receive feedback. They are becoming more useful as models move from one-shot answers to multi-step work in coding, browser tasks, customer support, and business software. RL environment companies Some companies sell custom environments for coding, finance, enterprise workflows, or computer-use tasks.
OpenClaw (Moltbot/Clawdbot) Use Cases and Security 2026
OpenClaw (formerly Moltbot and Clawdbot) is an open-source, self-hosted AI assistant designed to execute local computing tasks and interface with users through standard messaging platforms. Unlike traditional chatbots that function as advisors generating text, OpenClaw operates as an autonomous agent that can execute shell commands, manage files, and automate browser operations on the host machine.
Moltbook: Agent Driven Social Media [2026]
The rapid growth of OpenClaw has triggered an unusual social experiment: Moltbook, a Reddit-like social platform where agents interact with each other. Launched on the 28th of January, 2026, and started to get attention in a short time span. It reached 1.5m+ agents in its first week.
AIMultiple Newsletter
1 free email per week with the latest B2B tech news & expert insights to accelerate your enterprise.