Ekrem Sarı

AI Researcher

32 Articles

Stay up-to-date on B2B Tech

Ekrem is an AI Researcher and Data Analyst at AIMultiple. He designs and runs hands-on benchmarks for AI and LLM systems.

Professional Experience

At AIMultiple, Ekrem benchmarks end-to-end AI systems and builds the data workflows and dashboards used to track benchmark and product metrics. His benchmarks cover embedding and reranker models, vector and graph databases, inference engines, quantization, GPU concurrency and multi-GPU scaling, cloud GPU pricing and providers, text-to-SQL, and RAG and agentic RAG frameworks.

Before AIMultiple, he worked as an Assessor at Yandex, where he evaluated search quality and labeled large volumes of data against detailed guidelines to support ranking and model quality.

Research Interest

Ekrem's work focuses on the MLOps and LLMOps lifecycle and on measuring the performance of AI systems. He compares models, frameworks, and infrastructure on metrics such as accuracy, throughput, API cost, and scalability, across the stack from embedding models and vector databases to GPU and cloud infrastructure. His MSc thesis automates systematic literature reviews with a RAG-based pipeline.

Education

Ekrem holds a BA from Hacettepe University and is completing an MSc at Başkent University.

Latest Articles from Ekrem

Benchmark

Jul 2

RAG Evaluation Tools: Weights & Biases vs Ragas vs DeepEval

When a RAG pipeline retrieves the wrong context, the LLM confidently generates the wrong answer. Context relevance scorers are the primary defense. We benchmarked five tools across 1,460 questions and 14,600+ scored contexts under identical conditions: same judge model (GPT-4o), default configurations, and no custom prompts. Under standard conditions, WandB, TruLens, and Ragas emerged as…

Data

Benchmark

Jul 2

Remote Browsers: Web Infra for AI Agents Compared

AI agents rely on remote browsers to automate web tasks without being blocked by anti-scraping measures. The performance of this browser infrastructure is critical to an agent’s success. We benchmarked 8 providers on success rate, speed, and features. To do this, we executed 160 automated tasks, running 4 distinct scenarios 5 times for each service…

Data

Benchmark

Jul 1

Graph Database Benchmark: Neo4j vs FalkorDB vs Memgraph

We benchmarked Neo4j, FalkorDB, and Memgraph on a synthetic graph derived from 120,000 Amazon product reviews (381K nodes, 804K edges). We ran 12 query templates with 1,000 measurements each, tested ingestion at 6 batch sizes, sustained concurrent load for 60 seconds at up to 32 threads, and measured memory, cold start, mixed workload, and index…

Benchmark

Jul 1

LLM Inference Engines: vLLM vs LMDeploy vs SGLang

We benchmarked 3 leading LLM inference engines on NVIDIA H100: vLLM, LMDeploy, and SGLang. Each engine processed identical workloads: 1,000 ShareGPT prompts using Llama 3.1 8B-Instruct to isolate the true performance impact of their architectural choices and optimization strategies. We measured offline batch throughput across 10,000 total inference operations (1,000 prompts × 10 runs per…

Benchmark

Jun 30

Top 10 Multilingual Embedding Models for RAG

We benchmarked 10 multilingual embedding models on ~606k Amazon reviews across 6 languages (German, English, Spanish, French, Japanese, Chinese). We generated 1,800 queries (300 per language), each referencing concrete details from its source review. Models trained for search (query vs document separation) outperform larger models trained for general text similarity: e5_base (110M params) outperforms models…

Benchmark

Jun 30

Multi-GPU Benchmark: B200 vs H200 vs H100 vs MI300X

For over two decades, optimizing compute performance has been a cornerstone of my work. We benchmarked NVIDIA’s B200, H200, H100, and AMD’s MI300X to assess how well they scale for Large Language Model (LLM) inference. Using the vLLM framework with the meta-llama/Llama-3.1-8B-Instruct model, we ran tests on 1, 2, 4, and 8 GPUs. We analyzed…

Benchmark

Jun 29

Embedding Models: OpenAI vs Gemini vs Voyage

We benchmarked 15 English text-embedding models and a BM25 baseline on over 500 manually curated queries across three retrieval domains: legal contracts (CUAD), customer support (IBM TechQA), and healthcare (MedRAG PubMed). Voyage-3.5 ranks first overall. Perplexity Embed V1 0.6b reaches the upper-mid tier at the lowest price point in our benchmark. nDCG@3: Normalized discounted cumulative…

Benchmark

Jun 29

RAG Frameworks: LangChain vs LangGraph vs LlamaIndex

We benchmarked 5 RAG frameworks: LangChain, LangGraph, LlamaIndex, Haystack, and DSPy, by building the same agentic RAG workflow with standardized components: identical models (GPT-4.1-mini), embeddings (BGE-small), retriever (Qdrant), and tools (Tavily web search). This isolates each framework’s true overhead and token efficiency. The benchmark consisted of 100 queries, with each framework running the full set…

Benchmark

Jun 29

Reranker Benchmark: Top 8 Models Compared

We benchmarked 8 reranker models on ~145k English Amazon reviews to measure how much a reranking stage improves dense retrieval. We retrieved top-100 candidates with multilingual-e5-base, reranked them with each model, and evaluated the top-10 results against 300 queries, each referencing concrete details from its source review. The best reranker lifted Hit@1 from 62.67% to…

Agentic AI

Benchmark

Jun 29

Agentic Search in 2026: Benchmark 8 Search APIs for Agents

Agentic search plays a crucial role in bridging the gap between traditional search engines and AI search capabilities. Search APIs are the first layer of an agentic tool, where performance caps the quality of everything downstream. We benchmarked 8 search APIs across 100 real-world AI/LLM queries, evaluating 4,000 retrieved results with an LLM judge that…

1 2 3 4

Stay ahead of the curve with

AIMultiple Newsletter

1 free email per week with the latest B2B tech news & expert insights to accelerate your enterprise.