Nazlı Şipi
She is also part of the benchmark team, focusing on large language models (LLMs), AI agents, and agentic frameworks.
Nazlı holds a Master’s degree in Business Analytics from the University of Denver.
Latest Articles from Nazlı
Top 5 Open-Source Agentic AI Frameworks in 2026
We benchmarked four popular open-source agentic frameworks across 5 different tasks, running each task 100 times per framework. Agentic AI frameworks benchmark We examined how the frameworks themselves influence agent behavior and the resulting impact on latency and token consumption.
Vision Language Models Compared to Image Recognition
Can advanced Vision Language Models (VLMs) replace traditional image recognition models? To find out, we benchmarked 16 leading models across three paradigms: traditional CNNs (ResNet, EfficientNet), VLMs ( such as GPT-4.1, Gemini 2.5), and Cloud APIs (AWS, Google, Azure).
Agentic LLM Benchmark: Top 13 LLMs Compared
We benchmarked 13 LLMs across 10 software development tasks by using an agentic CLI tool. We executed ~300 automated validation steps per model to measure performance across both API and UI layers. Agentic LLM benchmark results Claude 4.5 Sonnet and GPT-5.
Compare Multimodal AI Models on Visual Reasoning
We benchmarked 15 leading multimodal AI models on visual reasoning using 200 visual-based questions. The evaluation consisted of two tracks: 100 chart understanding questions testing data visualization interpretation, and 100 visual logic questions assessing pattern recognition and spatial reasoning. Each question was run 5 times to ensure consistent and reliable results.
LLM Observability Tools: Weights & Biases, Langsmith
LLM-based applications are becoming more capable and increasingly complex, making their behavior harder to interpret. Each model output results from prompts, tool interactions, retrieval steps, and probabilistic reasoning that cannot be directly inspected. LLM observability addresses this challenge by providing continuous visibility into how models operate in real-world conditions.
AI Hallucination Detection Tools: W&B Weave & Comet
We benchmarked three hallucination detection tools: Weights & Biases (W&B) Weave HallucinationFree Scorer, Arize Phoenix HallucinationEvaluator, and Comet Opik Hallucination Metric, across 100 test cases. Each tool was evaluated on accuracy, precision, recall, and latency to provide a fair comparison of their real-world performance.
Benchmarking Agentic AI Frameworks in Analytics Workflows
Frameworks for building agentic workflows differ substantially in how they handle decisions and errors, yet their performance on imperfect real-world data remains largely untested.
Top 9 AI Providers Compared
The AI infrastructure ecosystem is growing rapidly, with providers offering diverse approaches to building, hosting, and accelerating models. While they all aim to power AI applications, each focuses on a different layer of the stack.
LLM Latency Benchmark by Use Cases in 2026
The effectiveness of large language models (LLMs) is determined not only by their accuracy and capabilities but also by the speed at which they engage with users. We benchmarked the performance of leading language models across various use cases, measuring their response times to user input.
AIMultiple Newsletter
1 free email per week with the latest B2B tech news & expert insights to accelerate your enterprise.