We benchmarked 14 open-source embedding models, self-hosted on a single H100, across 500+ manually curated retrieval queries spanning legal contracts, customer support tech notes, and medical abstracts. NVIDIA Llama-Embed-Nemotron-8B leads in accuracy. On cost, Google’s EmbeddingGemma-300m runs roughly 4x cheaper than Nemotron at the cost of a small accuracy loss.
Open source embedding models benchmark results
Metrics explained
nDCG@3: Normalized discounted cumulative gain at cutoff 3. With one relevant document per query, it is 1 / log2(rank + 1) when the gold document lands in the top 3, and 0 otherwise. Rank 1 scores 1.000, rank 2 scores 0.631, and rank 3 scores 0.500. We use nDCG@3 as the primary metric because production RAG pipelines feed the top 3 to 5 chunks to the LLM, and primacy bias makes rank 1 matter disproportionately.
nDCG@10: Same formula with cutoff 10.
Recall@10: Fraction of queries where the gold document appears in the top 10.
MRR@10: Mean reciprocal rank at cutoff 10. Gold at rank 1 scores 1.000, rank 2 scores 0.500, and rank 10 scores 0.100. Similar intent to nDCG@3 but with a steeper rank penalty.
Top-1 hit: Fraction of queries where the gold-relevant document is the single top result. The strictest metric and the one closest to a no-LLM lookup workflow.
nDCG@3 results by domain
The AVG ranking hides domain inversions. Harrier wins CUAD but lands seventh on TechQA. SFR-2 ranks second on TechQA but only fourth on CUAD. KaLM-12B is fifth on MedRAG and ninth on TechQA. Per-domain nDCG@3:
BM25 is competitive on MedRAG (0.7862, beating PubMedBERT and the multilingual Granite) and weak on CUAD (0.5844, where 11 of 14 dense models outrank it). Legal contracts contain dense entity language that rewards lexical match. On medical abstracts, the top dense models (Nemotron 0.9629, SFR-2 0.9620, jina-v5 0.9523) outrank BM25 by 0.17 to 0.18 nDCG@3 absolute points.
Bootstrap 95% confidence intervals per (model, domain) cell, including a four-way MedRAG tie at the top and a Harrier-Nemotron CUAD overlap that the point-estimate ranking flattens, are reported in the benchmark methodology section.
Cost per million tokens
The self-hosted cost is GPU-amortized: the hourly rate divided by tokens processed per hour. The pod we used was a RunPod community-cloud H100 80GB SXM5 at $2.99/hr. Wall-clock time per model across the 551-query, 3-corpus pass (~46.2M tokens total) yields the following $/1M tokens estimates:
The formula:
GPU $/hr = $2.99 (the RunPod community H100 80GB SXM5 rate of the pod we used). wall_seconds = each model’s total wall-clock time across the 551-query, 3-corpus pass. total_tokens ≈ 46.22M (sum of 3 corpora + 551 queries, char-count ÷ 4 heuristic).
Worked example, Nemotron-8B: ($2.99 / 3600) × (1247.8 × 1,000,000 / 46,220,000) = $0.0224 per 1M tokens.
Five models lead their cost tier (no other row both costs less and scores higher): Granite-278m-multilingual at the bottom of the cost ladder, then Granite-small-r2, EmbeddingGemma-300m, jina-v5-text-small, and Nemotron-8B at the top of the quality ladder. The endpoints span 13x in cost ($0.0017/M to $0.0224/M) and 0.23 nDCG@3 absolute (0.6952 to 0.9249).
Domain specialists vs generalists
PubMedBERT, fine-tuned on PubMed title-abstract pairs, is the obvious “right tool” for medical RAG retrieval on PubMed. It scores nDCG@3 = 0.7084 on MedRAG, which is below the BM25 lexical baseline (0.7862) on the same corpus. Modern open-source generalists outrank it by 0.22 to 0.25 absolute points on its training-data domain:
The reason the specialist underperforms is age and recipe. PubMedBERT is a 2022 110M-parameter BERT with symmetric mean pooling and no instruction prefix. The 2024-2026 generalists are built on bigger backbones, asymmetric query and document prefixes, and instruction-tuned retrieval objectives. The architectural gap matters more than the domain match: a 4-year-old fine-tune cannot keep up with a current-generation instruction-tuned retriever, even on the fine-tune’s own training corpus.
The buyer rule is to test a domain specialist against a modern generalist on representative queries before deploying it. The “specialist will win on its domain” assumption is no longer safe for open-source embedding models in 2026.
Findings from the open-source embedding benchmark
Nemotron-8B’s TechQA lead is statistically separated from second place
Nemotron-8B AVG nDCG@3 = 0.9249. Per-domain it lands at 0.8602 on CUAD, 0.9515 on TechQA, and 0.9629 on MedRAG. The TechQA result (0.9515 0.923, 0.977) does not overlap with the second-place SFR-Embedding-2_R (0.9109 0.869, 0.949). The bootstrap CIs are clean-separating. The 8B Llama-3.1 base, instruction-tuned for retrieval with a query-side Instruct: …\nQuery: … prefix and a symmetric document-side prefix, drives a 0.04 absolute nDCG@3 lead over the next row on long-document support workloads.
The two domains where Nemotron wins outright (TechQA, MedRAG) are the long-document corpora where instruction-prefix asymmetry matters most. CUAD is the one domain where it does not lead: Microsoft’s Harrier-oss-v1-0.6b (0.8720) outranks Nemotron (0.8602) on legal contracts despite being 13x smaller, though the CIs overlap and the lead is not statistically separated at this sample size.
A 0.6B Microsoft Harrier model outranks every open model under 7B parameters
Microsoft Harrier-oss-v1-0.6b (released 2026-04 with a Qwen3-0.6B base and an MIT license) lands at AVG nDCG@3 = 0.8911, fourth overall. It outranks the 12B Tencent KaLM-Gemma3 (0.8057, Tencent community license), the 7B Salesforce SFR-Embedding-2_R on CUAD (0.8421 vs Harrier 0.8720), and Google’s EmbeddingGemma-300m (0.8706). On a same-architecture comparison, Harrier-0.6b (0.8911) sits 0.074 nDCG@3 above Qwen3-Embedding-0.6B (0.8168), built on the identical Qwen3-0.6B base. The training corpus and instruction recipe drove the gap, not the parameter count.
For buyers, Harrier is the highest-ranked open-source row that ships with a license suitable for commercial use without restrictions. SFR-2 (CC-BY-NC), Nemotron (NSCL-v1), and jina-v5 (CC-BY-NC) outrank it on the AVG ladder, but all three are research-only or non-commercial.
A medical-specialist embedder loses to BM25
NeuML’s PubMedBERT-base-embeddings was fine-tuned on PubMed title-abstract pairs. It is the obvious “right tool” for a medical RAG benchmark on PubMed. It scores nDCG@3 = 0.7084 on MedRAG, which is 0.078 absolute below the BM25 lexical baseline (0.7862) on the same corpus. The top open-source generalists on MedRAG land far above both: Nemotron-8B 0.9629, SFR-Embedding-2_R 0.9620, Harrier-oss 0.9605, jina-v5 0.9523, KaLM-Gemma3-12B 0.9453.
This is the inversion that should change how a buyer picks a domain specialist. PubMedBERT is a 2022 110M-parameter BERT, with symmetric mean pooling and no instruction prefix. The 2024 to 2026 generalist field is built on bigger backbones, asymmetric query and document prefixes, and instruction-tuned retrieval objectives. On MedRAG queries that already include medical vocabulary, BM25’s lexical match is naturally strong, and PubMedBERT’s specialization adds nothing on top of it.
The practical conclusion is not to pick a specialist embedder by name alone. Benchmark it on your own queries before committing.
Snowflake Arctic swings 0.32 nDCG@3 across domains
Snowflake’s snowflake-arctic-embed-l-v2.0 (568M, Apache-2.0, bge-m3-retromae derivative, multilingual) scores nDCG@3 = 0.5846 on CUAD legal contracts and 0.9053 on MedRAG medical abstracts. The same model, same recipe, same query format, with a 0.32-point swing across two domains. Other models in the slate swing less: SFR-2 spans 0.8421 to 0.9620 (gap 0.12), Nemotron spans 0.8602 to 0.9629 (gap 0.10), Harrier spans 0.8408 to 0.9605 (gap 0.12).
The mechanism is training-data composition. Arctic was tuned on BEIR, MIRACL, and CLEF; legal contracts are not represented. For a vertical retrieval workload, domain training data matters more than parameter count or context length.
How open source embedding inference works
Open-source embedding models run in two backends in this benchmark: sentence-transformers (12 models) and vLLM (4 models). The split is not about quality; it is about runtime efficiency on 8B-and-larger models, where sentence-transformers’ default Python inference loop is too slow to be tractable.
Per-model recipe matters more than the choice of backend. Modern retrieval models use asymmetric prefixes: the query side is wrapped in an Instruct-style prompt (Instruct: Given a question, retrieve passages...\nQuery: <text>) while the document side is plain. Pooling type varies: BERT-derived models use CLS pooling; LLM-derived models (Llama, Mistral, Qwen3, Gemma3 base) use last-token pooling; multilingual models often use mean pooling. The HuggingFace card for each model is the source of truth for which prefix and pooling combination is correct.
Backend tier:
- vLLM: Nemotron-8B, KaLM-Gemma3-12B, jina-v5-text-small
- sentence-transformers: Qwen3-0.6B, EmbeddingGemma-300m, Granite trio, SFR-2, Conan-v1, PubMedBERT, GIST, Snowflake Arctic, Microsoft Harrier
Asymmetric prefix patterns observed:
- Instruct + Query/Document: SFR-2, KaLM-Gemma3, Nemotron-8B, Qwen3-Embedding
- Built-in encode_query / encode_document: EmbeddingGemma, KaLM-Gemma3, Nemotron-8B
- task / prompt_name (sentence-transformers parameter): jina-v5, Snowflake Arctic, Harrier
- No prefix (symmetric): Granite trio, Conan, PubMedBERT, GIST
Pooling type by base architecture:
- CLS pooling: Granite r2 trio, Snowflake Arctic
- Last-token pooling: Nemotron, KaLM-Gemma3, SFR-2, jina-v5, Qwen3-Embedding, Harrier
- Mean pooling: EmbeddingGemma, Granite-multilingual, Conan, PubMedBERT, GIST
Using the wrong recipe silently degrades retrieval quality without crashing. Any benchmark of open-source embedders should include a sanity floor (Recall@10 below 0.5 across all domains for any model is a red flag for a misconfiguration, not a result).
Open source embedding models benchmark methodology
Three retrieval domains were evaluated: CUAD legal contracts (246 queries, 509 contracts), TechQA customer-support technotes (151 queries, 28000 IBM technotes), MedRAG-PubMed healthcare abstracts (154 queries, 50000 abstracts). Total 551 queries.
The dataset construction methodology is shared with our prior English embedding models benchmark: Protocol-A 3-LLM consensus query generation (rotating writer pool, fixed scorer, two non-writer validators per attempt), corpus pinning by SHA-256 hash, per-domain entity-banned-token whitelists to prevent BM25 lexical shortcuts, Cohen’s κ inter-rater agreement reported per validator pair, BM25 baseline ranks synthesized from the bm25_rank_at_target field already present in each query JSON (Pyserini-equivalent). Primary metric nDCG@3 (RAG-realistic, what production RAG systems consume); secondary metrics nDCG@10, Recall@10, Recall@100, MRR@10, Top-1 hit.
Open-source-specific specs:
- GPU: 1 x NVIDIA H100 80GB SXM5 via RunPod community cloud
- Pod template:
runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404
- Stack: PyTorch 2.10.0+cu128, vLLM 0.19.1, transformers 5.6.2, sentence-transformers 5.4.1
- Per-model dispatch: HF model card primary path. ST for 12 models, vLLM for Nemotron-8B, KaLM-Gemma3-12B, jina-v5-text-small.
- Per-model chunking: char-level truncation at
max_seq_length x 4chars per token, then the model’s tokenizer truncates to its actual max sequence length.
- Asymmetric retrieval: every model that supports it gets the HF-card-documented query and document prefix. No prefix is the documented default for some.
- L2 normalization: applied uniformly post-pooling. Some models do this internally. We re-normalize to ensure parity across the slate.
- Embedding cache key: includes prefix + task + prompt_name + max_seq + backend, so a prefix swap mid-run cannot silently load stale embeddings.
- Statistical protocol: 10K bootstrap resamples per (model, domain, metric) cell, percentile 95% CI, seed=2026.
Models tested
Sorted by AVG nDCG@3 rank. Backend column: ST = sentence-transformers, vLLM = vLLM 0.19.
Bootstrap 95% confidence intervals results
The full leaderboard above is single-run per (model, domain) cell. Cross-session model-init variance is not measured. To capture within-run query-level variance, we resample the per-query rank vector for each (model, domain) cell 10,000 times with replacement (percentile method, seed=2026, sample sizes CUAD n=246, TechQA n=151, MedRAG n=154). Per-domain bootstrap 95% CI on nDCG@3:
The CIs do change which inversions the data supports. On CUAD, Harrier (0.8720, [0.836, 0.906]) and Nemotron (0.8602, [0.821, 0.897]) overlap, so the Harrier-on-CUAD lead is not clean-separating at this sample size. On TechQA, Nemotron (0.9515, [0.923, 0.977]) and SFR-2 (0.9109, [0.869, 0.949]) do not overlap, so Nemotron’s TechQA lead is statistically separated. On MedRAG, the top four (Nemotron 0.9629, SFR-2 0.9620, Harrier 0.9605, jina-v5 0.9523) are within each other’s CIs and form a four-way statistical tie. The PubMedBERT-below-BM25 inversion on MedRAG (0.7084 [0.641, 0.772] vs BM25 0.7862) is on the margin of overlap. The central tendency clearly puts the specialist below BM25, but a 3-run cross-session pass is needed to resolve it as separated rather than overlapping.
Limitations
Single run per (model, domain) cell. The bootstrap CI table above captures within-run query-level variance (10K resamples, percentile method, seed=2026), but cross-session model-init variance is not measured. A 3-run cross-midnight pass is planned for v2.1. The closer ties surfaced by the CI table (e.g., the four-way MedRAG tie at the top, the Harrier-Nemotron CUAD overlap, the PubMedBERT-vs-BM25 marginal inversion) would benefit most from the multi-run pass.
Per-model context-length confound. Models with 512-token context windows (Granite-278m-multilingual, PubMedBERT, Conan, GIST) only see the first ~2K characters of each document. Models with 8K or 32K context (Nemotron, KaLM-12B, jina-v5, Harrier, Granite r2 english) see the full document. This favors long-context models on TechQA (long technotes) and MedRAG (long abstracts).
MedRAG training-data contamination risk. Several of the evaluated models were trained on PubMed-derived data (PubMedBERT by definition, possibly Granite-278m-multilingual, possibly Qwen3 base). Some MedRAG nDCG@3 boost may reflect training-data overlap rather than retrieval quality.
Conan-v1 is Chinese-trained. Including it on English-only domains is an instructive data point on language-mismatch rather than a fair head-to-head on English retrieval quality. We expect underperformance versus English-trained peers and that is what the data shows.
Conclusion
NVIDIA Llama-Embed-Nemotron-8B leads at AVG nDCG@3 = 0.9249 with statistically separated TechQA and MedRAG wins. The highest-ranked open-source pick under an unrestricted license (MIT) is Microsoft Harrier-oss-v1-0.6b at AVG 0.8911. Google EmbeddingGemma-300m runs at roughly 4x lower cost for a small accuracy hit.
Further reading
Explore other RAG benchmarks, such as:
- Top 10 Multilingual Embedding Models for RAG
- Embedding Models: OpenAI vs Gemini vs Voyage
- Top Vector Database for RAG: Qdrant vs Weaviate vs Pinecone
- Reranker Benchmark: Top 8 Models Compared
- Multimodal Embedding Models: Apple vs Meta vs OpenAI
- Hybrid RAG: Boosting RAG Accuracy
- Graph RAG vs Vector RAG
Be the first to comment
Your email address will not be published. All fields are required.