Background image

Author: Nomic Team

The Nomic Embedding Ecosystem

Over the past several months, Nomic has released a number of State-of-the-Art (SOTA) open source embedding models. With so many models, it can be hard to know which one to use for your application. In this post, we'll explore the ecosystem of Nomic Embed models and how they can be used for a wide range of applications. We will describe the benefits and features of each model, and provide benchmark comparisons to the relevant peer models.

The Ecosystem at a Glance

Nomic Embed v2

A state of the art multilingual embedder with a mixture-of-experts architecture.

Key Features:

  • SOTA performance on the MIRACL benchmark
  • Support for 100+ languages
  • 305M active parameters for efficient inference
  • Open weights, data, and code

Nomic Embed Text v1.5

The most popular open source text embedder on Hugging Face.

Key Features:

  • Outperforms OpenAI Embeddings on the MTEB benchmark
  • Matryoshka and Binary embeddings for efficient storage
  • 137M parameters for efficient inference
  • Open weights, data, and code

Nomic Embed Vision v1.5

A vision embedder aligned to the Nomic Embed Text v1.5 latent space.

Key Features:

  • Multimodal extension to Nomic Embed Text v1.5
  • Strong performance across text, image, and mixed modality search in a unified latent space
  • Open weights, data, and code

Nomic Embed Code

A state of the art code embedding model.

Key Features:

  • SOTA performance on the CodeSearchNet benchmark
  • Supports Python, Javascript, Java, Go, PHP, and Ruby
  • Pair with Nomic CodeRankEmbed-137M efficient inference
  • Open weights, data, and code

Nomic Embed V2

Nomic Embed v2 is a state of the art multilingual embedder that supports over a hundred languages. Nomic Embed v2 is the first general purpose embedder to utilize a mixute-of-experts architecture, which enables highly efficient inference by activating only a small subset of the model's parameters at inference time. This enables it to outperform other general purpose embedders of its size on the multilingual MMTEB benchmark, as shown below:

ModelAvgBittext MiningClass.Clust.Pair Class.RerankingRetrievalSTS$/1M Token
Nomic Embed V262.4865.1260.3745.5976.3361.7257.2671.03$0.01
Voyage 3 Lite60.8860.1257.9345.6975.0760.2958.9268.20$0.02
OpenAI Embed 3 Small58.6450.3255.1646.7876.6459.9452.2469.42$0.02
Arctic Embed M 2.058.4453.7354.3843.0274.8661.6754.8366.60$0.01

Note that we use $/1M Tokens as a proxy for model size, as some models do not report their size publicly. Further, we also only report publicly available metrics from the MMTEB leaderboard, as some private models do not report all MMTEB subscores.

Nomic Embed v2 also achieves SOTA performance on the MIRACL benchmark, outperforming much larger models including Voyage-3-Large and OpenAI Text Embedding 3 Large.

ModelAvgarbndeenesfafifrhiidjakoruswtethyozh
Nomic Embed v266.076.773.656.654.756.359.277.155.860.554.267.065.965.266.382.678.378.359.5
Voyage-3-Large59.569.668.346.248.443.851.170.839.854.847.262.363.957.867.976.774.575.652.1
OpenAI Text Embedding 3 Large54.9------------------

Note that OpenAI does not report their MIRACL subscores, so we only report their average score.

Nomic Embed v2's size makes it ideal for retrieve-rerank workflows, where a small model surfaces a candidate pool which is reordered by a larger, more expensive model.

For a full breakdown of the model, see the Nomic Embed v2 paper or blog post. To use the model, see the Nomic Embed v2 Hugging Face model page.

Nomic Embed Code

Nomic Embed Code is a 7B parameter code embedding model optimized for code search that achieves state-of-the-art performance on the CodeSearchNet (CSN) benchmark. The Nomic Embed Code ecosystem also includes Nomic CodeRankEmbed-137M, a highly efficient code embedding model that achieves SOTA performance on the CSN benchmark for size. Nomic Embed Code can be used as a standalone model, or paired with Nomic CodeRankEmbedr-137M in a retrieve-rerank workflow.

ModelPythonJavaRubyPHPJavaScriptGo
Nomic Embed Code81.680.581.972.377.193.8
Voyage Code 380.980.584.671.779.293.2
Nomic CodeRankEmbed-137M78.476.979.368.871.492.7
OpenAI Embed 3 Large70.872.975.359.668.187.6
CodeSage Large v274.272.376.765.272.584.6

Nomic Embed Text v1.5

Nomic Embed Text v1.5

As of the time of this writing, Nomic Embed v1.5 is the most popular open source embedder on Hugging Face, having been downloaded over 35 million times.

It's easy to see why - Nomic Embed Text v1.5 outperforms the industry standard OpenAI Embeddings on the MTEB benchmark, and at only 137M parameters, it's incredibly easy to scale to massive text collections. Its compact size also makes it ideal for locally running applications - a full precision Nomic Embed v1.5 clocks in at over 100qps on a standard M2 MacBook. Nomic Embed Text v1.5 also supports two efficient storage formats: Matryoshka and Binary.

When combined, Nomic Embed's local inference and binary storage capabilities enable a powerful workflow we call Retrieve-Rerank with Local Embedding Models, which reduces vector storage costs by up to 100x with virtually no loss in downstream performance. We illustrate this binary retrieve-rerank in the figure below:

Binary Retrieve Rerank

Binary Retrieve Rerank Performance

In the chart above, average precision measures how closely the documents surfaced by retrieve-rerank match a full precision retrieval. From this chart, we can see that the performance of binary retrieve-rerank workflow is virtually identical to the performance of the full precision retrieval.

Moreover, Nomic Embed Text v1.5 can be paired with other powerful rerankers to achieve powerful retrieval performance at a fraction of the cost. For example, Nomic Embed Text v1.5 paired with Voyage-3-Large achieves SOTA performance on the MicroBEIR benchmark at a fraction of the cost of other methods.

MicroBEIR

For a full breakdown of the model, see the Nomic Embed Text paper or blog post. To use the model, see the Nomic Embed v1.5 Hugging Face model page.

Nomic Embed Vision v1.5

Nomic Embed Architecture

Nomic Embed Vision v1.5 is a vision embedder that is aligned to the Nomic Embed Text v1.5 latent space. This enables the Nomic Embed 1.5 latent space to achieve strong performance across text, image, and mixed modality search. This stands in contrast to most CLIP style models, which sacrifice performance on text search to achieve strong performance on image search.

For a full breakdown of the model, see the Nomic Embed Vision v1.5 paper or blog post. To use the model, see the Nomic Embed v1.5 Hugging Face model page.

Conclusion

In this post, we've explored the Nomic embedding ecosystem - a collection of truly open source state-of-the-art models for text, vision, and code embedding applications. From the powerful multilingual Nomic Embed v2, to the widely-adopted Nomic Embed Text v1.5, to specialized models like Nomic Embed Vision and Nomic Embed Code, each model is designed to excel at specific tasks while remaining efficient and accessible. These models demonstrate that it's possible to achieve industry-leading performance while maintaining open access to weights, data, and code.

nomic logo
nomic logonomic logo nomic logo nomic logonomic logonomic logo nomic logo nomic logo
“Henceforth, it is the map that precedes the territory” – Jean Baudrillard