Ekrem Sarı
Ekrem is an AI Researcher at AIMultiple, focusing on intelligent automation, GPUs, AI Agents, and LLMOps for RAG frameworks.
Professional Experience
During his tenure as an Assessor at Yandex, he evaluated search results using proprietary frameworks and automated protocols. He implemented QA testing through data annotation, relevance scoring, and user intent mapping across 10,000+ queries monthly, while conducting technical assessments, including performance monitoring and spam detection using ML feedback loops.Research Interest
At AIMultiple, his research is centered on the MLOps lifecycle and the performance and benchmarking of end-to-end AI systems. He contributes to a wide range of projects, including Retrieval-Augmented Generation (RAG) optimization, extensive Large Language Model (LLM) benchmarking, and the design of agentic AI frameworks. Ekrem specializes in developing data-driven methodologies to measure and improve AI technology performance across critical operational metrics like accuracy, efficiency, API cost, and scalability.His analysis covers the entire technology stack, from foundational components like embedding models and vector databases to the high-performance GPU and cloud infrastructure required for deploying AI agents.
Education
Ekrem holds a bachelor's degree from Hacettepe Üniversitesi and a master's degree from Başkent Üniversitesi.Latest Articles from Ekrem
Benchmark of 38 LLMs in Finance: Claude Opus 4.6, Gemini 3.1 Pro & More
We evaluated 38 LLMs in finance on 238 hard questions from the FinanceReasoning benchmark to identify which models excel at complex financial reasoning tasks like statement analysis, forecasting, and ratio calculations. LLM finance benchmark overview We evaluated LLMs on 238 hard questions from the FinanceReasoning benchmark (Tang et al.).
Agentic Search in 2026: Benchmark 8 Search APIs for Agents
Agentic search plays a crucial role in bridging the gap between traditional search engines and AI search capabilities. These systems enable AI agents to autonomously find, retrieve, and structure relevant information, powering applications from research assistance to real-time monitoring and multi-step reasoning.
Graph RAG vs Vector RAG Benchmark
Vector RAG retrieves documents by semantic similarity. Graph RAG adds a knowledge graph on top of it, extracts entities and relationships from your documents, stores them in a graph database, and uses graph traversal alongside vector search at query time.
RAG Observability Tools Benchmark
We benchmarked four RAG observability platforms on a 7-node LangGraph pipeline across three practical dimensions: latency overhead, integration effort, and platform trade-offs. Latency overhead metrics Metrics explained: Mean is the average latency across 150 measured graph.invoke() calls. LLM-judge evaluations run after the timer stops. Median is the 50th percentile latency.
RAG Evaluation Tools: Weights & Biases vs Ragas vs DeepEval
When a RAG pipeline retrieves the wrong context, the LLM confidently generates the wrong answer. Context relevance scorers are the primary defense. We benchmarked five tools across 1,460 questions and 14,600+ scored contexts under identical conditions: same judge model (GPT-4o), default configurations, and no custom prompts.
Hybrid RAG: Boosting RAG Accuracy
Dense vector search is excellent at capturing semantic intent, but it often struggles with queries that demand high keyword accuracy. To quantify this gap, we benchmarked a standard dense-only retriever against a hybrid RAG system that incorporates SPLADE sparse vectors.
Supervised Fine-Tuning vs Reinforcement Learning
Can large language models internalize decision rules that are never stated explicitly? To examine this, we designed an experiment in which a 14B parameter model was trained on a hidden “VIP override” rule within a credit decisioning task, without any prompt-level description of the rule itself.
Text-to-SQL: Comparison of LLM Accuracy
I have relied on SQL for data analysis for 18 years, beginning in my days as a consultant. Translating natural-language questions into SQL makes data more accessible, allowing anyone, even those without technical skills, to work directly with databases.
Top 20+ Agentic RAG Frameworks
Agentic RAG enhances traditional RAG by boosting LLM performance and enabling greater specialization. We conducted a benchmark to assess its performance on routing between multiple databases and generating queries. Explore agentic RAG frameworks and libraries, key differences from standard RAG, benefits, and challenges to unlock their full potential.
Embedding Models: OpenAI vs Gemini vs Cohere
The effectiveness of any Retrieval-Augmented Generation (RAG) system depends on the precision of its retriever. We benchmarked 11 leading text embedding models, including those from OpenAI, Gemini, Cohere, Snowflake, AWS, Mistral, and Voyage AI, using ~500,000 Amazon reviews.
AIMultiple Newsletter
1 free email per week with the latest B2B tech news & expert insights to accelerate your enterprise.