Background image

Author: Nomic Team

Introducing Nomic Embed: A Truly Open Embedding Model

Nomic Embed Beats OpenAI

We're excited to announce the release of Nomic Embed, the first

text embedding model with a 8192 context-length that outperforms OpenAI Ada-002 and text-embedding-3-small on both short and long context tasks. We release the model weights and training code under an Apache-2 license, as well as the curated data we used to train the model. We also release a detailed technical report.

Nomic Embed is in general availability for production workloads through the Nomic Atlas Embedding API with 1M free tokens included and is enterprise-ready via our fully secure and compliant Nomic Atlas Enterprise offering.

Text embeddings are an integral component of modern NLP applications powering retrieval-augmented-generation (RAG) for LLMs and semantic search. They encode semantic information about sentences or documents into low-dimensional vectors that are then used in downstream applications, such as clustering for data visualization, classification, and information retrieval. Currently, the most popular long-context text embedding model is OpenAI's text-embedding-ada-002, which supports a context length of 8192. Unfortunately Ada is closed source and it's training data is not auditable.

Top performing open source long-context text embedding models such E5-Mistral and jina-embeddings-v2-base-en are either not practical for general-purpose use due to model size or fail to exceed the performance of their OpenAI counterparts.

Nomic-embed changes that.

How Are Text Encoders Trained?

Text encoders are usually trained with contrastive learning on large collections of paired texts in multiple stages.

At the high level, the Transformer architecture is first pre-trained with a self-supervised MLM objective (BERT), then contrastively trained with web-scale unsupervised data and finally contrastively finetuned with a smaller, curated corpus of paired data.

The first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.

In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.

How We Built Nomic Embed

In this blog post, we outline the high level recipe for building nomic-embed. For further details please see our technical report.

Training a 2048 Context-Length BERT

To train nomic-embed, we followed a multi-stage contrastive learning pipeline. We start our model from a BERT initialization. Since bert-base only handles context lengths up to 512 tokens, we train our own 2048 context length BERT, nomic-bert-2048.

We make several modifications to our BERT training procedure inspired by MosaicBERT. Namely, we:

We also implement the following training optimizations:

We evaluate the quality of nomic-bert-2048 on the standard GLUE benchmark. We find it performs comparably to other BERT models but with the advantage of a significantly longer context length.

ModelBszStepsSeqAvgColaSST2MRPCSTSBQQPMNLIQNLIRTE
NomicBERT4k100k20480.840.500.930.880.900.920.860.920.82
RobertaBase8k500k5120.860.640.950.900.910.920.880.930.79
JinaBERTBase4k100k5120.830.510.950.880.900.810.860.920.79
MosaicBERT4k178k1280.850.590.940.890.900.920.860.910.83

Contrastive Training of Nomic Embed

We initialize the training of nomic-embed with nomic-bert-2048. Our contrastive dataset is composed of ~235M text pairs. We extensively validated its quality during collection with Nomic Atlas. You can find dataset details in the nomic-ai/constrastors codebase as well as explore a 5M pair subset in Nomic Atlas.

On the Massive Text Embedding Benchmark (MTEB), nomic-embed outperforms text-embedding-ada-002 and jina-embeddings-v2-base-en.

NameSeqLenMTEBLoCoJina Long ContextOpen WeightsOpen Training CodeOpen Data
nomic-embed819262.3985.5354.16
jina-embeddings-v2-base-en819260.3985.4551.90
text-embedding-3-small819162.2682.4058.20
text-embedding-ada-002819160.9952.755.25

Unfortunately, MTEB doesn't evaluate models on long-context tasks. Therefore, we additionally evaluated nomic-embed on the recently released LoCo Benchmark as well as the Jina Long Context Benchmark.

For the LoCo Benchmark, we split evaluations into parameter class and whether the evaluation is performed in a supervised or unsupervised setting. We bold the top performing model in each split. Nomic Embed is the best performing 100M parameter class unsupervised model. Notably, Nomic Embed is competitive with the top performing models in both the 7B parameter class and with models trained in a supervised setting specifically for the LoCo benchmark:

ModelSeqParam.AvgTau Scrolls Summ.Tau Scrolls Gov.Tau Scrolls QMSUMQASPER - Title to ArticleQASPER - Abstract to Article
Unsupervised 100M
Jinav28192137M85.593.398.640.895.199.3
nomic-embed8192137M85.5390.997.7397.7894.8799.9
text-embedding-ada-0028192N/A52.737.344.37.3085.189.7
text-embedding-3-small8192N/A82.4192.1697.6527.4295.998.8
Unsupervised 7B
E5 Mistral40967B87.895.998.346.898.499.8
Supervised
M2-Bert204880M83.681.894.758.587.395.5
M2-Bert819280M87.994.796.564.186.897.5

Nomic Embed also outperforms jina-embeddings-v2-base-en in aggregate on the Jina Long Context Benchmark. Unfortunately, Nomic Embed does not outperform OpenAI ada-002 or text-embedding-3-small on this benchmark:

ModelSeqNarrativeQAWikiCitiesSciFactBigPatentAvg.
nomic-embed819237.7884.2670.1724.4654.16
Jinav2819239.475.769.423.151.9
text-embedding-ada-002819241.184.772.722.555.3
text-embedding-3-small819247.1289.973.3222.4858.2

Overall Nomic Embed outperforms OpenAI Ada-002 and text-embedding-3-small on 2/3 benchmarks.

Towards Improved Evaluation of Embedding Models

As models become increasingly complex and benchmarks become increasingly saturated, we will need new paradigms for evaluating our models. We believe that the direct comparison of model embedding spaces can reveal model characteristics that are not captured by benchmarks. As an exercise, we compared the embedding spaces of Nomic Embed and OpenAI Ada on a 250K sample of english wikipedia. Toggling between the two spaces reveals a structural differences between the representations built by the two models. The most stark difference in the map below is in each model's treatment of disambiguation pages. Nomic Embed opts to cluster these articles together in an island on the bottom right of the map, while OpenAI Ada opts to sort them into several regions of the space. To see this, search "may refer to" by clicking the magnifying glass, and then toggle between the embedding spaces.

The models' different treatment of disambiguation articles is a systematic difference in their behaviors that is not captured by the benchmarks. This underscores the importance of developing new evaluation paradigms that can fully capture the differences between models.

Nomic Embedding API and Atlas Enterprise

We release the Nomic Embed model weights and full-training data for complete model auditability. Nomic recognizes enterprises require fully-auditable AI and we're proud to offer the first performant text embedding model that can achieve it. Contact Nomic to learn about Nomic Atlas Enterprise.

The best option to use Nomic Embed is through our production-ready Nomic Embedding API.

You can access the API via HTTP and your Nomic API Key:

curl https://api-atlas.nomic.ai/v1/embedding/text \
    -H "Authorization: Bearer $NOMIC_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{ "model": "nomic-embed-text-v1",
          "texts": ["Nomic AI introduces Nomic Embed", "#keepAIOpen"]}'

and in the official Nomic Python Client after you pip install nomic,

from nomic import embed
import numpy as np

output = embed.text(
    texts=[
        "Who is Laurens van der Maaten?",
        "What is dimensionality reduction?",
    ],
    model='nomic-embed-text-v1',
)

print(output['usage'])

embeddings = np.array(output['embeddings'])

print(embeddings.shape)

Data Access

To access the full data, we provide Cloudflare R2 access keys to the buckets containing the data. To get access, create a Nomic Atlas account and follow the instructions in the contrastors repo.

Nomic asks that if you want to use a public inference service for accessing Nomic Embed, you choose the Atlas Embedding API. This allows Nomic to continue driving future open-source AI innovation. Remember, you can always access and run the model without usage restrictions by simply downloading the open-source model weights.

nomic logo
nomic logonomic logo nomic logo nomic logonomic logonomic logo nomic logo nomic logo
“Henceforth, it is the map that precedes the territory” – Jean Baudrillard