We benchmarked 15 English text-embedding models and a BM25 baseline on over 500 manually curated queries across three retrieval domains: legal contracts (CUAD), customer support (IBM TechQA), and healthcare (MedRAG PubMed).
Voyage-3.5 ranks first overall. Perplexity Embed V1 0.6b reaches the upper-mid tier at the lowest price point in our benchmark.
Embedding models benchmark results
Metrics explained
nDCG@3: Normalized discounted cumulative gain at cutoff 3. With one relevant document per query, it is 1 / log2(rank + 1) when the gold document lands in the top 3, and 0 otherwise. Rank 1 scores 1.000, rank 2 scores 0.631, rank 3 scores 0.500. We use nDCG@3 as the primary metric because production RAG pipelines feed the top 3 to 5 chunks to the LLM, and primacy bias makes rank 1 matter disproportionately.
nDCG@10: Same formula with cutoff 10.
Recall@10: Fraction of queries where the gold document appears in top 10.
MRR@10: Mean reciprocal rank at cutoff 10. Gold at rank 1 scores 1.000, rank 2 scores 0.500, and rank 10 scores 0.100. Similar intent to nDCG@3 but with a steeper rank penalty.
Top-1 hit: Fraction of queries where the gold-relevant document is the single top result. The strictest metric and the one closest to a no-LLM lookup workflow.
nDCG@3 by domain
Legal (CUAD, 246 queries, 509 contracts): Legal is the only domain where the specialist voyage-law-2 wins; its CUAD-tuned training data pays off by +0.040 nDCG@3 over voyage-4-large. openai/text-embedding-3-large ranks 11th at 0.6430, below six cheaper models. BM25 floor: 0.5844.
Customer support (TechQA, 151 queries, 28,000 IBM technotes): The gap from voyage-4-lite to the next model is 0.018. gemini-embedding-001 drops to 7th (0.8856), 0.045 behind its newer sibling on TechQA even though it wins the other two domains. BM25 floor: 0.6097.
Healthcare (MedRAG-PubMed, 154 queries, 50,000 abstracts): Healthcare is the tightest cluster in our benchmark (14 models score above 0.88) because medical vocabulary is keyword-dense, which pushes most queries into the top cluster. BM25 floor: 0.7862, within 0.02 of the weakest dense model. gemini-embedding-001 also beats gemini-embedding-2-preview by its widest margin here (+0.013).
The domain-level flips justify the 3-domain-average framing: No single domain is a fair proxy for “which model is best,” and a buyer picking on one domain will misrank on the others.
Per-model 95% bootstrap confidence intervals for each domain cell, plus the four pairwise ties that point-estimate rankings hide, are detailed in the methodology section.
Accuracy vs price: Cost per 1M tokens
Metrics explained
Price per 1M input tokens is the list price for embedding 1M tokens of input, as of 2026-04-23. Voyage prices come from the Voyage direct pricing page. OpenRouter-served models use the OpenRouter catalog snapshot of the same day. Query and document tokens are priced at the same rate across every vendor tested. BM25 is plotted at $0.001/M for log-axis rendering. The true self-host cost is $0.
3-domain average nDCG@3 is the unweighted mean of per-domain nDCG@3 across the three corpora. Each domain contributes equally to the average regardless of query count.
- For cost-first RAG platforms, pplx-embed-v1-0.6b is the clear pick. At $0.004/M it is 30-50x cheaper than any of the commercial flagships and delivers 92% of voyage-3.5’s quality (0.8604 / 0.9429). No other model in our benchmark competes at its price point.
- For quality-first enterprise RAG, voyage-3.5 via the Voyage direct SDK takes the top Pareto point. You trade one extra API integration (versus an OpenRouter-only stack) for a marginally better model than Voyage’s own flagship at half the price. The “always pick the newest and biggest” instinct is wrong inside Voyage’s catalog.
- For OSS / self-hostable / on-prem deployments, qwen3-embedding-8b wins. It is the cheapest non-trivial embedder in our benchmark at $0.010/M, matches or beats every other OSS encoder family we tested, and ships with self-hostable weights.
- The premium flagships (openai-3-large, gemini-2-preview, voyage-4-large, gemini-001) all lose to voyage-3.5 on the 3-domain average, even though voyage-3.5 is 2-3x cheaper than any of them.
Key findings from the embedding benchmark
voyage-3.5 wins the 3-domain average and beats the voyage-4-large flagship at half the price
voyage-3.5 averages 0.9429 nDCG@3 across legal, customer support, and healthcare. The flagship voyage-4-large averages 0.9416 at $0.12 per 1M tokens, 2x the $0.06 voyage-3.5 price. The flagship wins TechQA by 0.002 and wins MedRAG by 0.032. It loses CUAD by 0.037 (0.8730 vs 0.9102), enough that its 3-domain average sits below voyage-3.5. Inside Voyage’s lineup, the older mid-tier model is the better general-purpose pick. The flagship only earns its premium on healthcare.
Voyage took the top spot on all three domains and swept the top-2 on CUAD and TechQA. On MedRAG, gemini-embedding-001 broke into 2nd place (0.9814, behind voyage-4-large’s 0.9855), ahead of every other Voyage model. gemini-001 also reaches third on CUAD. No other non-Voyage model reaches top-2 on any single domain.
A legacy Gemini model beats its newer “preview” sibling on two of three domains
google/gemini-embedding-001 (released June 2025) outperforms google/gemini-embedding-2-preview on both CUAD (0.8980 vs 0.8958) and MedRAG (0.9814 vs 0.9685). The newer model wins only TechQA (0.9301 vs 0.8856), a 0.04 gap that comes with a 33% price increase ($0.20 vs $0.15 per 1M input tokens). The “newer multimodal upgrade” framing of Gemini 2 does not hold up in English text retrieval on legal or healthcare corpora.
For RAG workloads on those two domains today, gemini-embedding-001 is the correct Gemini pick. The flip in MedRAG (001 at 2nd, 2-preview at 3rd) is large enough that a buyer defaulting to the “newest” model loses measurable quality.
OpenAI text-embedding-3-large is mid-tier in legal and customer support
openai/text-embedding-3-large ranks 11th of 15 dense models on CUAD at 0.6430 nDCG@3. Eight strictly cheaper models beat it on legal contracts: both $0.12 Voyage 4-series flagships, voyage-3.5 at half the price, voyage-4-lite at 1/6 the price, both Qwen3 embedding variants, intfloat/e5-large-v2 at 1/13 the price, and perplexity/pplx-embed-v1-0.6b (0.8031) at 1/32 the price. OpenAI’s flagship is 9th on TechQA (0.8581) and 11th on MedRAG (0.9296). Healthcare puts it in a tight top cluster (spread from 2nd to 11th: 0.05 nDCG@3). On legal the gap is wide and expensive.
At $0.13 per 1M input tokens it is 32x more expensive than pplx-embed-v1-0.6b. Teams defaulting to OpenAI because “it’s the safe choice” are paying a premium that the 3-domain data does not justify.
pplx-embed-v1-0.6b reaches the top tier at one-thirtieth the price of comparable flagships
perplexity/pplx-embed-v1-0.6b at $0.004 per 1M tokens averages 0.8604 nDCG@3 across the three domains, behind only the four Voyage models, the two Gemini variants, and qwen/qwen3-embedding-8b. It beats every OpenAI and OSS model in the lineup. It also beats openai/text-embedding-3-large by 0.16 nDCG@3 on CUAD, loses by 0.012 on TechQA (0.8457 vs 0.8581), and wins by 0.003 on MedRAG. The next cheapest top-10 model is qwen/qwen3-embedding-8b at $0.010 (2.5x more), also served over OpenRouter.
For cost-first RAG platforms where embedding is a material line item, pplx-0.6b is the clear choice. The 30-50x gap to flagship pricing buys nothing in retrieval quality, essentially on these three domains.
BM25 is within 0.02 of the weakest dense model on medical abstracts
On MedRAG-PubMed, BM25 scores 0.7862 nDCG@3 against baai/bge-m3 (dense mode) at 0.8038, a 0.02 gap. Lexical search comes within 0.15 of seven of the fifteen dense models on this corpus (bge-m3, e5-base-v2, openai-3-small, e5-large-v2, openai-3-large, pplx-0.6b, qwen3-4b). The reason is structural: medical queries are keyword-dense by design (drug names, disease names, study-design terms, gene symbols), and those tokens carry most of the retrieval signal. A Lucene-style scorer matches them directly without needing semantic context.
A reranker on top of BM25 is a plausible, cheaper alternative to a premium dense embedder for keyword-dense corpora: the retrieval gap BM25 leaves (0.2 nDCG@3 to the top tier on MedRAG) is the kind of gap a Cohere or Voyage reranker can close. On CUAD the gap from BM25 to the best dense model is 0.33, on TechQA 0.36, on MedRAG 0.20. Domain-vocabulary density is the single biggest determinant of how much dense embeddings help.
Domain specialists vs generalists across vendors
Voyage prices voyage-law-2 at $0.12/M, identical to voyage-4-large. The two models share vendor, tokenizer, SDK, and asymmetric invocation scheme. Only the training data emphasis differs. Running both against generalists across CUAD, TechQA, and MedRAG isolates the effect of the legal training.
On CUAD, voyage-law-2 ranks 1st at 0.9126: 0.0024 above voyage-3.5, 0.0146 above gemini-embedding-001, 0.040 above voyage-4-large, 0.097 above qwen3-embedding-8b, and 0.270 above openai/text-embedding-3-large (0.6430). On TechQA, voyage-law-2 ranks 4th at 0.9020, 0.064 behind voyage-4-large and 0.063 behind voyage-3.5. On MedRAG, it ranks 6th at 0.9409, 0.045 behind voyage-4-large and 0.041 behind gemini-embedding-001. The legal training raises nDCG@3 on CUAD and lowers it on the other two domains.
A legal team shipping CUAD-style retrieval on openai/text-embedding-3-large runs at 0.6430 nDCG@3 against voyage-law-2 at 0.9126, a 0.27 gap. A healthcare or support team that picks voyage-law-2 because it ranked first on CUAD loses 0.045 to voyage-4-large on MedRAG and 0.064 on TechQA. Domain-specialist embedding models are not drop-in upgrades for generic retrieval. A single “best model” recommendation across industries picks wrong in at least one direction.
When to pick voyage-law-2: contract-retrieval on commercial legal corpora that structurally resemble CUAD. When not to: anything else in this benchmark. voyage-3.5 is $0.06/M, lands 0.0024 below voyage-law-2 on CUAD, and outperforms it on both TechQA and MedRAG.
How the embedding retrieval pipeline was evaluated
Each model encodes one query vector and N document vectors via a bi-encoder. We compute cosine similarity between the query vector and every document vector, then sort the top-k for that query. With one gold document per query and binary relevance, the evaluator checks whether the gold appears in the top-k and at what rank. That rank feeds into nDCG@3 (our primary), nDCG@10 (for BEIR/MTEB comparability), Recall@10, and Top-1 hit rate.
Query and document encoders are not always the same function. Some models are trained asymmetrically: the query side applies one transform, the document side applies another. Invoking those models symmetrically (“just pass the text in”) silently degrades retrieval quality by 0.05-0.45 nDCG@10. Our lineup splits four ways:
Why nDCG@3 as primary. Production RAG pipelines feed the top 3 to 5 chunks to the LLM, not the top 10. Primacy bias in long-context LLMs makes rank 1 matter more than rank 3, and each distractor that lands above the gold in the LLM context is a candidate for confabulation. Rerankers would flatten this effect, but most production RAGs run without one for cost and latency reasons, so the embedder’s rank IS the final rank.
On MedRAG, Recall@10 hit the 1.000 ceiling for three Voyage models and for qwen3-8b; nDCG@3 preserved a 0.10 spread on the same queries. nDCG@10 retains BEIR comparability but softens the top-of-list differences that matter operationally.
Embedding models benchmark methodology
Corpora (domain selection + why)
We chose three domains that stress different retrieval properties and that cover the three most common enterprise RAG. Every corpus is SHA256-pinned, so any reader can reproduce the exact cell we ran.
PM209 (manufacturing manuals) was dropped: only 209 documents, too small to prevent the BM25 entity-shortcut issue at 150 queries.
Query generation: 3-LLM consensus protocol
Our queries are LLM-generated under writer-validator separation: the LLM that drafts a query never judges its own retrieval target, so self-bias is structurally excluded. Only the two non-writer validators, seeing the 20 candidates shuffled with no hint about which was the writer’s grounding document, decide acceptance. On top of the LLM consensus, we spot-reviewed roughly 25% of the accepted query set by hand (author review of query naturalness, target-document alignment, and R9 compliance, independent of the validator vote).
Every query passed the following pipeline before entering the production set:
- Writer drafts a single query grounded in one randomly sampled document. Writer rotates across Claude Sonnet 4.6, Qwen3.6-plus, and Gemini 3 Flash preview so no single model dominates the linguistic fingerprint.
- Scorer (fixed: Claude Sonnet 4.6) rates the query on a specificity rubric. We required semantic_bridge ≥ 4 (the query must semantically describe what the document asserts, not just name-match) and unique_referent in 3-5 (the descriptive anchors must identify roughly one to five candidate documents in the corpus, not thousands or exactly one).
- Hard-negative check: We pull the BM25 top-19 distractor documents plus the target and run a near-duplicate Jaccard gate (>0.5 → reject the whole query as ambiguous ground truth).
- Validators (2 models, never including the writer) independently pick the target document from the shuffled 20-candidate set. Both validators must agree on the exact target slot or the query is discarded. “None of the above” and “multiple correct answers” are valid validator responses and also discard the query.
- Cohen’s kappa computed per validator pair. Each query had exactly 2 raters (the non-writers from the 3-model pool), so the 3 possible writer-excluded pairs give us 3 separate kappa values per domain. We report them individually and as an n-weighted mean.
Per-pair Cohen’s kappa with observed agreement po and expected-by-chance pe, computed on accepted queries plus all consensus_fail rejects where both validators reached a decision. Cells show n / po / pe / κ:
The n-weighted mean is a descriptive summary, not an inferential statistic. It condenses the three per-pair kappas into one number weighted by how many queries each pair judged; it is not itself a kappa value for a pooled dataset, and CI on it would need to be computed via bootstrap resampling at the query level (deferred to v2.1).
We used Cohen’s kappa (not Fleiss’ kappa or Krippendorff’s alpha) because every query had exactly 2 raters: the natural framing here is 3 pairwise Cohen calculations since we want to know whether any two specific models agree, not whether a panel of 3 raters coheres. Krippendorff’s alpha would give a single number but would mix the three pairs together and hide the pair-level variance.
CUAD specifically: claude × qwen reaches κ=0.974 while claude × gemini and gemini × qwen sit around κ=0.86, which isolates Gemini-3-flash-preview as the noisiest judge on legal contracts. That information is a methodology signal worth surfacing, not averaging away.
We promoted a domain to production after the n-weighted-mean kappa cleared 0.85. All three cleared it. MedRAG’s 0.986 is effectively ceiling: the two disagreements across 156 attempts were on medically ambiguous targets where both validators were internally consistent but one picked a related-but-non-gold abstract.
R9 entity-anonymization ruleset (per-domain)
R9 is a hard constraint at query-generation time. Without it, BM25 rises above 0.97 nDCG@10 because named entities act as perfect keyword shortcuts; dense embeddings have no semantic edge to measure. The rule is tailored per domain so the anchors that actually carry retrieval signal in that domain remain usable:
- CUAD strict. Ban all named entities: party names, US state names, personnel, monetary amounts in exact dollars, specific product names. Force descriptive uniqueness: industry + role + temporal era + monetary range + geographic scope. BM25 ceiling dropped from 0.97 to 0.591 after R9 was enforced.
- TechQA Option X. IBM product names allowed (they are the primary retrieval signal for a sysadmin) IF the query also contains a secondary non-product descriptive anchor (symptom class, error code family, version era, deployment context). Customer names, US states, personnel still banned. BM25 ceiling: 0.664.
- MedRAG medical-relaxed + hallucination-safe. Drug names, disease terms, anatomy, gene symbols retained verbatim from the source because substituting drug-class labels risks pharmacology hallucination (“p-chloroamphetamine” is an amphetamine-class serotonin releaser, but LLM label translations of rarer drugs fail silently). Query must contain ≥2 non-drug anchors so pure-drug-name keyword match does not carry the result. BM25 ceiling: 0.809 (structural property of the domain, not a methodology flaw).
Example queries
For each example, the query is the text we feed the embedding model. The gold document is the single item in the corpus (out of 509 CUAD contracts, 28,000 TechQA technotes, or 50,000 PubMed abstracts) that actually answers the query. The retrieval task is: embed the query, compute cosine similarity against every document in the corpus, and rank them. If the gold document lands at rank 1 the query scores 1.000 on nDCG@3; rank 2 scores 0.631; rank 3 scores 0.500; below top-3 scores 0.
CUAD (legal)
Query:
Gold document (1 of 509 CUAD contracts): ANIXABIOSCIENCESINC_06_09_2020-EX-10.1-COLLABORATION AGREEMENT. It is a 2020 collaboration between a German firm and a US biotech for COVID-19 drug discovery; the contract specifies a milestone payment due when the first patient enters Phase I of a clinical trial. The query contains no party names, no monetary amounts, and no geographies beyond two country tokens; the retrieval signal is industry + temporal + milestone structure.
TechQA (customer support)
Query:
Gold document (1 of 28,000 IBM technotes): swg1IY43185, which documents exactly that WebSEAL bug and names the patch that fixes it. IBM product name (WebSEAL) is allowed under our TechQA R9 variant, but the discriminator is the behavioral pattern of the bug and the request-ordering anchor, not the product name alone.
MedRAG (healthcare)
Query:
Gold document (1 of 50,000 PubMed abstracts): PMID:231299, a clinical trial comparing adverse-reaction dropout rates between cephradine and pivmecillinam in pregnant women with urinary tract infections. Drug names are retained because the drug-versus-drug comparison is the retrieval signal, but the query adds patient population + treatment duration + adverse-event framing so a pure drug-name BM25 match does not land the target on its own.
Statistical protocol
Bootstrap 95% confidence intervals use 10,000 resamples, percentile method, seed=2026 on the per-query metric vector. Paired bootstrap on same query indices for pairwise significance between model A and model B (claim requires ≥95% of resamples where A > B).
Single run per (model, domain) cell. A 3-run cross-session variance layer is deferred to v2.1 for cost reasons. Within-session embedding API calls are deterministic within a few parts per million cosine difference, verified on spot-check; bootstrap CI therefore captures query-level noise which is the dominant variance source at n=150-246.
Indexing and scoring
No vector database. Each model encodes every corpus document once; cosine similarity is computed directly in NumPy as a dense matrix product of L2-normalized embeddings. This is exact, not approximate, so rank ties are genuine model ties and not ANN artifacts.
Per-model chunking rule: 512-ctx models chunk 512+64 overlap; 8K-20K-ctx models chunk to context with no overlap; 32K+-ctx models ingest the full document when it fits (CUAD’s 9% long-tail exceeds every non-Nemotron context window and falls back to chunking; cross-model fairness is preserved by applying the same per-context-size policy to every model).
Per-model asymmetric-retrieval invocation is the single most impactful methodology detail and deserves a dedicated section. It is the reason gemini-embedding-2-preview scores 0.46 nDCG@10 under OpenRouter’s documented code sample versus 0.91 under Google’s Vertex AI format. See “How the embedding retrieval pipeline was evaluated” above for the per-family table.
Eval framework: ranx as primary metric engine; trec_eval-style output compatible with MTEB leaderboard submissions. Bootstrap CI computed by scripts/bootstrap_ci.py over the per-query metric arrays saved at the eval pass.
Models tested
Prices are as of 2026-04-23 from the OpenRouter catalog and the Voyage direct pricing page.
Per-model nDCG@3 with 95% bootstrap CI
Bootstrap 95% confidence intervals computed via 10,000 resamples of the per-query metric vector (percentile method, seed=2026). CI widths of 0.03-0.07 at these sample sizes (n=154-246) mean that point-estimate gaps below ~0.03 are within noise and should be treated as ties. Sorted by 3-domain average nDCG@3:
Four statistical ties where point-estimate rankings are not significant at 95% CI:
Limitations
Human review by one author: One author spot-reviewed roughly 25% of the final accepted queries for naturalness, target alignment, and R9 compliance.
Conclusion
voyage-3.5 averages 0.9429 nDCG@3 across legal, customer support, and healthcare, beating Voyage’s own flagship at half the price and OpenAI’s text-embedding-3-large by 0.13 nDCG@3 at less than half the price.
Pick pplx-embed-v1-0.6b at $0.004/M if the embedding cost has to be a rounding error. Pick voyage-3.5 at $0.060/M for the top Pareto point. Pick qwen/qwen3-embedding-8b at $0.010/M to stay OSS. Use voyage-law-2 only for CUAD-adjacent legal retrieval, where it buys +0.04 nDCG@3 on CUAD and nothing elsewhere.
Further reading
Explore other RAG benchmarks, such as:
- Top 10 Multilingual Embedding Models for RAG
- Top 16 Open Source Embedding Models for RAG
- Top Vector Database for RAG: Qdrant vs Weaviate vs Pinecone
- Reranker Benchmark: Top 8 Models Compared
- Multimodal Embedding Models: Apple vs Meta vs OpenAI
- Hybrid RAG: Boosting RAG Accuracy
- Graph RAG vs Vector RAG
Be the first to comment
Your email address will not be published. All fields are required.