Berk Kalelioğlu

AI Researcher

9 Articles

Stay up-to-date on B2B Tech

Berk is an AI researcher at AIMultiple. He has prior experience in game development and in developing pseudorandom number generators using chaotic systems.

Research interests

Berk focuses on machine learning, agentic AI tools, and large and small language models (LLMs and SLMs).

He is part of the AIMultiple benchmark team, conducting assessments and providing insights to help readers understand emerging technologies and their real-world applications.

Professional experience

He began his career as a Tech Project Lead at ODTU IVME-R, where he led a project to build physical quantum and pseudorandom number generators.

After his tenure at IVME-R, he co-founded a game development company and released a game on Steam.

He later shifted his career toward AI and joined AIMultiple as a Researcher.

Education

Berk holds a Bachelor’s degree in Mathematics from Ankara University.

Latest Articles from Berk

Open World Evaluation

Jul 23

Best Flat-Rate LLM API Providers in 2026

Flat-rate LLM providers sell unlimited model usage for a fixed monthly price instead of billing per token. This model spread because agentic coding sessions can use tens of millions of tokens, so a per-token bill is hard to predict. Very few providers offer a true flat fee; most plans marketed as flat carry a usage…

Agentic AI

Benchmark

Jul 23

A-CODE-LLM Bench: Agentic Coding Benchmark

We benchmarked the top Large Language Models (LLMs) across 10 software development tasks using an agentic CLI tool. We executed ~3,500 automated validation steps per model across both API and UI layers. Each alias ran 3 times across 10 tasks (30 samples per alias, 400 cells per iteration across 40 aliases). See more details on…

Agentic AI

Benchmark

Jul 21

AIM Agentic Marketing Benchmark

We are introducing the AIM Agentic Marketing Benchmark, which measures agent performance in competitive gap analysis and ABM target list preparation. We tested the performance of 11 models and measured end-to-end execution performance: The task scores are normalized to a 0–100 scale. The overall score is the arithmetic mean of the two task scores, giving…

Agentic AI

Benchmark

Jul 21

AI VC Benchmark: 11 AI Agents on Venture Capital Tasks

Partnering with early stage VCs, we converted two analyst workflows into benchmarks with human-verified ground truth and scored 11 AI agents on them. See the tasks, results and the scoring method: Each of the 11 models ran each task once. Scores are out of 100. Kimi K3 produced no scorable deal-sourcing run and is recorded…

Agentic AIJul 16

Moltbook: Agent Driven Social Media [2026]

The rapid growth of OpenClaw has triggered an unusual social experiment: Moltbook, a Reddit-like social platform where agents interact with each other. Launched on the 28th of January, 2026, and started to get attention in a short time span. It reached 1.5m+ agents in its first week. For further platforms for AI agents, read Inside…

Agentic AIJul 16

OpenClaw (Moltbot/Clawdbot) Use Cases and Security 2026

OpenClaw (formerly Moltbot and Clawdbot) is an open-source, self-hosted AI assistant designed to execute local computing tasks and interface with users through standard messaging platforms. Unlike traditional chatbots that function as advisors generating text, OpenClaw operates as an autonomous agent that can execute shell commands, manage files, and automate browser operations on the host machine.…

Agentic AI

Benchmark

Jul 6

A-CODE-CLI Bench: Agentic CLI Benchmark

Agentic CLI tools are AI coding tools that can create and delete files, run commands, plan, and execute the coding of the entire project. We benchmarked the leading tools across 10 real-world web development scenarios, performing ~600 atomic validation checks per agent and more than ~5,000 total automated test executions, including backend logic, frontend functionality,…

Benchmark

Jul 3

Tabular Models Benchmark: Performance Across 19 Datasets 2026

We benchmarked 8 tabular learning models on 19 real-world datasets covering roughly 260,000 samples, with dataset sizes from 435 to 48,800 rows. Every model ran on the same machine with 5-fold cross-validation and identical splits. Each dataset is a round-robin of head-to-head matches between models, decided by the primary metric. Elo aggregates all 483 matches…

Enterprise Software

Benchmark

May 14

VPS Benchmark: Hetzner vs Digital Ocean

We benchmarked 6 Virtual Private Server (VPS) providers by running ~1,200 automated tests per server across CPU, memory, disk I/O, and network speed using sysbench, fio, and speedtest-cli. We also documented the full signup-to-SSH experience for each provider. We used 4 vCPU (Shared) / 8 GB Plans of each provider, without adding any extras or…

Stay ahead of the curve with

AIMultiple Newsletter

1 free email per week with the latest B2B tech news & expert insights to accelerate your enterprise.