Ekrem Sarı
Ekrem is an AI Researcher at AIMultiple, focusing on intelligent automation, GPUs, AI Agents, and LLMOps for RAG frameworks.
Professional Experience
During his tenure as an Assessor at Yandex, he evaluated search results using proprietary frameworks and automated protocols. He implemented QA testing through data annotation, relevance scoring, and user intent mapping across 10,000+ queries monthly, while conducting technical assessments, including performance monitoring and spam detection using ML feedback loops.Research Interest
At AIMultiple, his research is centered on the MLOps lifecycle and the performance and benchmarking of end-to-end AI systems. He contributes to a wide range of projects, including Retrieval-Augmented Generation (RAG) optimization, extensive Large Language Model (LLM) benchmarking, and the design of agentic AI frameworks. Ekrem specializes in developing data-driven methodologies to measure and improve AI technology performance across critical operational metrics like accuracy, efficiency, API cost, and scalability.His analysis covers the entire technology stack, from foundational components like embedding models and vector databases to the high-performance GPU and cloud infrastructure required for deploying AI agents.
Education
Ekrem holds a bachelor's degree from Hacettepe Üniversitesi and a master's degree from Başkent Üniversitesi.Latest Articles from Ekrem
RAG Frameworks: LangChain vs LangGraph vs LlamaIndex
We benchmarked 5 RAG frameworks: LangChain, LangGraph, LlamaIndex, Haystack, and DSPy, by building the same agentic RAG workflow with standardized components: identical models (GPT-4.1-mini), embeddings (BGE-small), retriever (Qdrant), and tools (Tavily web search). This isolates each framework’s true overhead and token efficiency.
Multimodal Embedding Models: Apple vs Meta vs OpenAI
Multimodal embedding models excel at identifying objects but struggle with relationships. Current models struggle to distinguish “phone on a map” from “map on a phone.” We benchmarked 7 leading models across MS-COCO and Winoground to measure this specific limitation. To ensure a fair comparison, we evaluated every model under identical conditions using NVIDIA A40 hardware and bfloat16 precision.
Multi-GPU Benchmark: B200 vs H200 vs H100 vs MI300X
For over two decades, optimizing compute performance has been a cornerstone of my work. We benchmarked NVIDIA’s B200, H200, H100, and AMD’s MI300X to assess how well they scale for Large Language Model (LLM) inference. Using the vLLM framework with the meta-llama/Llama-3.1-8B-Instruct model, we ran tests on 1, 2, 4, and 8 GPUs.
Top Serverless Functions: Vercel vs Azure vs AWS
Serverless functions enable developers to run code without having to manage a server. This allows them to focus on writing and deploying applications while infrastructure scaling and maintenance are handled automatically in the background. In this benchmark, we evaluated 7 popular cloud service providers following our methodology to test their serverless function performance.
Top Bot Management Platforms
Bot management identifies real users, good and bad bots, safeguarding websites, APIs, and digital assets from automated threats.
AIMultiple Newsletter
1 free email per week with the latest B2B tech news & expert insights to accelerate your enterprise.