40,000 Engineering Hours / Year to Test AI and Enterprise Software

Benchmarking is hard. Every business has different needs which can not be perfectly simulated outside of those companies.

Our benchmarking approach relies on these pillars:

Continuous improvement: As products mature, their use cases evolve so do how we run our benchmarks
Equal access: Every brand and our readers have access to the same data points about our tests.
Relevant use cases: There are infinitely many ways to use each tech solution. We strive to produce as realistic benchmarks as possible by
- Becoming long term users of the products we evaluate
- Interviewing experts
- Analyzing case studies and reviews to understand other users' experiences
Transparency:
- We follow the scientific method and publish our methodology including timing of the benchmark along with every benchmark. Our aim is to help others understand what we measured and reproduce our findings if they like.
- We would like to publish test data in every benchmark. However this can lead to data poisoning with certain products performing better on test data vs reality. To avoid this, most of our tests are completed with holdout datasets. We strive to complement the holdout datasets with open source datasets whenever we can.
Reproducibility: Performance fluctuates over time. Therefore, for each metric, we complete multiple measurements over time. In cases where we have not achieved that, we will highlight this issue as part of the benchmark.

Benchmarks by the numbers

AI:
- Hallucination rates of more than a dozen LLMs are ranked.
- More than 10 agentic RAG solutions are rated in terms of their correct database selection rates
- Top 10 LLMs are scored in terms of their accuracy in SQL code generation.
- All hyperscalers' AI image recognition solutions benchmarked using 100 images.
- Top AI coding solutions are evaluated across 6 metrics.
- Top AI avatar software compared across 10+ dimensions.
Web data:
- Proxies: Sent 6 million web pages requests for the load test as part of our enterprise-scale web data collection benchmark.
- Web scraping APIs: Tested more than 40 web scraping APIs on a range of websites including e-commerce platforms and search engines
Application security: 10 web scans analyzed for our DAST benchmark
Data security: 5 DLP tools evaluated across 10+ metrics in our DLP benchmark
IT automation:
- 3 vendors compared with data transfer rates across 5 regions as part of our managed file transfer benchmark
- 7 products compared across 8 metrics in our RMM benchmark.
- Top vendors evaluated across 10+ metrics in our ITSM benchmark.

Our approach

Our unique approach involves industry analysts running technical benchmarks and writing the benchmark reports. This providers enterprise users the unfiltered user perspective. Effective product benchmarking requires strong technical skills which we foster through our internal training program.

AIMultiple Academy

We have launched AIMultiple Academy as a structured training program designed to elevate our team's technical capabilities. Our CTO leads these hands-on sessions, combining theoretical instruction with practical assignments that provide real-world experience. Through this initiative, we're transforming our analysts into AI-empowered builders who can confidently evaluate and benchmark complex products. This technical upskilling represents a strategic investment in our team's ability to deliver more thorough, insightful product reviews and benchmarks.

So why don't we just vibe code our benchmarks?

Consistency over time: Our benchmarks need to be run repeatedly to measure improvement in performance. Even though modern AI coding tools like Cursor and Windsurf can help create functional MVPs, deploying these applications still requires deeper developer knowledge that goes beyond just generating code. Without proper DevOps and infrastructure expertise, teams struggle to move from prototype to production environment.

Security: AI-generated code without proper review and understanding leaves systems vulnerable to security exploits. Our training emphasizes identifying and mitigating these potential attack vectors to ensure benchmarks remain secure and reliable.

Understanding: While AI can generate code, our analysts still need fundamental software knowledge to interpret these benchmarks accurately.

Common confidence intervals

Since we are running a limited number of tests, it is necessary to calculate confidence intervals and we used this formula and 95% confidence intervals across the report.

Participants

Given time and resource constraints, we typically run benchmarks with the largest vendors in a specific domain. Metrics like number of employees help us identify the largest brands. The specific criteria used in identifying products to be benchmarked is explained in each benchmark.

We thank hundreds of brands that provide us access to their products either by providing credits or generous free trial periods that allow us to benchmark solutions.

Rarely, some brands choose not to participate in some of our benchmarks. In such cases, we rely on public data to evaluate their products.

Why do we focus on benchmarking B2B tech products?

Transparent, data driven benchmarks of product performance are rare. Legacy industry analysts rely on opaque and potentially biased assessments where only these data are published:

High-level qualitative (i.e. market understanding) and quantitative criteria that products are evaluated against
High-level assessments of these criteria without disclosing the values driving the assessment

These assessments rely on data provided by vendors which have undisclosed commercial relations with analysts.

Therefore the results are subject to numerous issues such as:

Analyst bias: Analysts evaluate vendor representatives’ responses including qualitative responses. Vendor representatives with commercial relationships with the industry analyst have the chance to build relationships with them by scheduling calls over the year. However, vendor representatives without such commercial relationships would present their product over a single call.
Conflict of interest: For these assessments, vendor representatives are asked about their private data (e.g. revenues, features, roadmap etc.). Since it would be clear which responses lead to better outcomes for the vendor (e.g. higher product revenues are likely to result in a higher rank), vendor representatives face a conflict of interest.

Enterprises can make better technology decisions after reviewing objective and data-driven benchmarks.