Services
Contact Us
Berk Kalelioğlu

Berk Kalelioğlu

AI Researcher
10 Articles
Stay up-to-date on B2B Tech
Berk is an AI researcher at AIMultiple. He has prior experience in game development and in developing pseudorandom number generators using chaotic systems.

Research interests

Berk focuses on machine learning, agentic AI tools, and large and small language models (LLMs and SLMs).

He is part of the AIMultiple benchmark team, conducting assessments and providing insights to help readers understand emerging technologies and their real-world applications.

Professional experience

He began his career as a Tech Project Lead at ODTU IVME-R, where he led a project to build physical quantum and pseudorandom number generators.

After his tenure at IVME-R, he co-founded a game development company and released a game on Steam.

He later shifted his career toward AI and joined AIMultiple as a Researcher.

Education

Berk holds a Bachelor’s degree in Mathematics from Ankara University.

Latest Articles from Berk

Agentic AIJun 15

A-CODE-LLM Bench: Agentic Coding Benchmark

We benchmarked the top Large Language Models (LLMs) across 10 software development tasks using an agentic CLI tool. We executed ~3,500 automated validation steps per model across both API and UI layers. A-CODE-LLM Bench results Each alias ran 3 times across 10 tasks (30 samples per alias, 270 cells per iteration).

AIJun 5

HALC-Bench: LLM Hallucination on Long-Context Retrieval Benchmark

HALC-Bench (LLM Hallucination on Long-Context Retrieval Benchmark) measures a large language model’s resistance to fabricating evidence for a metric that does not exist in the target document by using 3 haystacks placed at the beginning, middle, and end of the model’s context window, with 204 questions. Results gpt-5.

Agentic AIJun 3

Agentic CLI Benchmark: Codex Wins, Kiro Fastest

Agentic CLI tools are AI coding tools that can create and delete files, run commands, plan, and execute the coding of the entire project.

Agentic AIMay 26

VELC-Bench: Verification on Long Context Benchmark

The model’s ability to locate a specific metric in context, compare its value to a claim, and confirm or reject it. This tests fine-grained value matching under long-context conditions. The model must both retrieve the value and perform a precise comparison.

Agentic AIMay 26

RELC-Bench: Retrieval on Long Context Benchmark

RELC-Bench (RELC-Bench: Retrieval on Long Context Benchmark) aims to measure a model’s ability to find and extract a specific numeric value from one or more documents within its context. It tests whether the model can remember and retrieve a specific fact it just saw in the input.

AIMay 22

Tabular Models Benchmark: Performance Across 19 Datasets 2026

We benchmarked 7 widely used tabular learning models to identify top-performing model families across 19 real-world datasets of varying sizes and structures, covering ~260,000 samples and over 250 total features, with dataset sizes ranging from 435 to nearly 49,000 rows. Tabular learning models benchmark results In the chart, the winning model receives 1 point.

Enterprise SoftwareMay 14

VPS Benchmark: Hetzner vs Digital Ocean

We benchmarked 6 Virtual Private Server (VPS) providers by running ~1,200 automated tests per server across CPU, memory, disk I/O, and network speed using sysbench, fio, and speedtest-cli. We also documented the full signup-to-SSH experience for each provider.

Agentic AIApr 24

RL Environments: The Infrastructure Behind Agentic AI

Reinforcement learning environments are controlled environments where AI agents take actions, observe outcomes, and receive feedback. They are becoming more useful as models move from one-shot answers to multi-step work in coding, browser tasks, customer support, and business software. RL environment companies Some companies sell custom environments for coding, finance, enterprise workflows, or computer-use tasks.

Agentic AIApr 16

OpenClaw (Moltbot/Clawdbot) Use Cases and Security 2026

OpenClaw (formerly Moltbot and Clawdbot) is an open-source, self-hosted AI assistant designed to execute local computing tasks and interface with users through standard messaging platforms. Unlike traditional chatbots that function as advisors generating text, OpenClaw operates as an autonomous agent that can execute shell commands, manage files, and automate browser operations on the host machine.

Agentic AIFeb 6

Moltbook: Agent Driven Social Media [2026]

The rapid growth of OpenClaw has triggered an unusual social experiment: Moltbook, a Reddit-like social platform where agents interact with each other. Launched on the 28th of January, 2026, and started to get attention in a short time span. It reached 1.5m+ agents in its first week.