Nomic Blog: Introducing Nomic Embed: A Truly Open Embedding Model

Author: Nomic Team

Introducing Nomic Embed: A Truly Open Embedding Model

We're excited to announce the release of Nomic Embed, the first

Open source
Open data
Open training code
Fully reproducible and auditable

text embedding model with a 8192 context-length that outperforms OpenAI Ada-002 and text-embedding-3-small on both short and long context tasks. We release the model weights and training code under an Apache-2 license, as well as the curated data we used to train the model. We also release a detailed technical report.

Nomic Embed is in general availability for production workloads through the Nomic Atlas Embedding API with 1M free tokens included and is enterprise-ready via our fully secure and compliant Nomic Atlas Enterprise offering.

Text embeddings are an integral component of modern NLP applications powering retrieval-augmented-generation (RAG) for LLMs and semantic search. They encode semantic information about sentences or documents into low-dimensional vectors that are then used in downstream applications, such as clustering for data visualization, classification, and information retrieval. Currently, the most popular long-context text embedding model is OpenAI's text-embedding-ada-002, which supports a context length of 8192. Unfortunately Ada is closed source and it's training data is not auditable.

Top performing open source long-context text embedding models such E5-Mistral and jina-embeddings-v2-base-en are either not practical for general-purpose use due to model size or fail to exceed the performance of their OpenAI counterparts.

Nomic-embed changes that.

How Are Text Encoders Trained?

Text encoders are usually trained with contrastive learning on large collections of paired texts in multiple stages.

At the high level, the Transformer architecture is first pre-trained with a self-supervised MLM objective (BERT), then contrastively trained with web-scale unsupervised data and finally contrastively finetuned with a smaller, curated corpus of paired data.

The first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.

In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.

How We Built Nomic Embed

In this blog post, we outline the high level recipe for building nomic-embed. For further details please see our technical report.

Training a 2048 Context-Length BERT

To train nomic-embed, we followed a multi-stage contrastive learning pipeline. We start our model from a BERT initialization. Since bert-base only handles context lengths up to 512 tokens, we train our own 2048 context length BERT, nomic-bert-2048.

We make several modifications to our BERT training procedure inspired by MosaicBERT. Namely, we:

Use Rotary Position Embeddings to allow for context length extrapolation.
Use SwiGLU activations as it has been shown to improve model performance
Set dropout to 0

We also implement the following training optimizations:

We train with Deepspeed and FlashAttention.
We train in BF16 precision
We increase the vocab size to a multiple of 64
We train with a batch size of 4096
During masked language modeling, we mask at a 30% rate instead of 15%
We do not use the next sentence prediction objective

We evaluate the quality of nomic-bert-2048 on the standard GLUE benchmark. We find it performs comparably to other BERT models but with the advantage of a significantly longer context length.

Model	Bsz	Steps	Seq	Avg	Cola	SST2	MRPC	STSB	QQP	MNLI	QNLI	RTE
NomicBERT	4k	100k	2048	0.84	0.50	0.93	0.88	0.90	0.92	0.86	0.92	0.82
RobertaBase	8k	500k	512	0.86	0.64	0.95	0.90	0.91	0.92	0.88	0.93	0.79
JinaBERTBase	4k	100k	512	0.83	0.51	0.95	0.88	0.90	0.81	0.86	0.92	0.79
MosaicBERT	4k	178k	128	0.85	0.59	0.94	0.89	0.90	0.92	0.86	0.91	0.83

Contrastive Training of Nomic Embed

We initialize the training of nomic-embed with nomic-bert-2048. Our contrastive dataset is composed of ~235M text pairs. We extensively validated its quality during collection with Nomic Atlas. You can find dataset details in the nomic-ai/constrastors codebase as well as explore a 5M pair subset in Nomic Atlas.

On the Massive Text Embedding Benchmark (MTEB), nomic-embed outperforms text-embedding-ada-002 and jina-embeddings-v2-base-en.

Name	SeqLen	MTEB	LoCo	Jina Long Context	Open Weights	Open Training Code	Open Data
nomic-embed	8192	62.39	85.53	54.16	✅	✅	✅
jina-embeddings-v2-base-en	8192	60.39	85.45	51.90	✅	❌	❌
text-embedding-3-small	8191	62.26	82.40	58.20	❌	❌	❌
text-embedding-ada-002	8191	60.99	52.7	55.25	❌	❌	❌

Unfortunately, MTEB doesn't evaluate models on long-context tasks. Therefore, we additionally evaluated nomic-embed on the recently released LoCo Benchmark as well as the Jina Long Context Benchmark.

For the LoCo Benchmark, we split evaluations into parameter class and whether the evaluation is performed in a supervised or unsupervised setting. We bold the top performing model in each split. Nomic Embed is the best performing 100M parameter class unsupervised model. Notably, Nomic Embed is competitive with the top performing models in both the 7B parameter class and with models trained in a supervised setting specifically for the LoCo benchmark:

Model	Seq	Param.	Avg	Tau Scrolls Summ.	Tau Scrolls Gov.	Tau Scrolls QMSUM	QASPER - Title to Article	QASPER - Abstract to Article
Unsupervised 100M
Jinav2	8192	137M	85.5	93.3	98.6	40.8	95.1	99.3
nomic-embed	8192	137M	85.53	90.9	97.73	97.78	94.87	99.9
text-embedding-ada-002	8192	N/A	52.7	37.3	44.3	7.30	85.1	89.7
text-embedding-3-small	8192	N/A	82.41	92.16	97.65	27.42	95.9	98.8
Unsupervised 7B
E5 Mistral	4096	7B	87.8	95.9	98.3	46.8	98.4	99.8
Supervised
M2-Bert	2048	80M	83.6	81.8	94.7	58.5	87.3	95.5
M2-Bert	8192	80M	87.9	94.7	96.5	64.1	86.8	97.5

Nomic Embed also outperforms jina-embeddings-v2-base-en in aggregate on the Jina Long Context Benchmark. Unfortunately, Nomic Embed does not outperform OpenAI ada-002 or text-embedding-3-small on this benchmark:

Model	Seq	NarrativeQA	WikiCities	SciFact	BigPatent	Avg.
nomic-embed	8192	37.78	84.26	70.17	24.46	54.16
Jinav2	8192	39.4	75.7	69.4	23.1	51.9
text-embedding-ada-002	8192	41.1	84.7	72.7	22.5	55.3
text-embedding-3-small	8192	47.12	89.9	73.32	22.48	58.2

Overall Nomic Embed outperforms OpenAI Ada-002 and text-embedding-3-small on 2/3 benchmarks.

Towards Improved Evaluation of Embedding Models

As models become increasingly complex and benchmarks become increasingly saturated, we will need new paradigms for evaluating our models. We believe that the direct comparison of model embedding spaces can reveal model characteristics that are not captured by benchmarks. As an exercise, we compared the embedding spaces of Nomic Embed and OpenAI Ada on a 250K sample of English Wikipedia.

This is the embedding space of the Wikipedia articles using OpenAI's Ada model. Each point represents an article, and its color represents its topic in a hierarchical topic model.

The topic model assigning colors to points can be viewed at its broad or specific level. Currently, we are looking at broad topics. We can see that the Ada model is decent at arranging points by these broad topics: colors are generally grouped together.

But when we change the colors to reflect the specific topic of each point in the topic model (more granular), we no longer see as much coherent grouping of points by color.

Now, we are looking at the embedding space of the Wikipedia articles using Nomic Embed Text.

Toggling between the two spaces reveals a structural differences between the representations built by the two models.

We show this on a specific cluster by filtering for the text "may refer to", which will match all the Wikipedia disambiguation articles.

These articles all get very similar representations using Nomic Embed Text: it opts to cluster these articles together in an island on the bottom right of the map.

Now let's toggle back to view the articles in the Ada embedding space, while keeping our previous filter activated.

We see now that, instead of representing disambiguation articles similarly, Ada opts to sort these disambiguation articles into disparate regions of the embedding space.

The models' different treatment of disambiguation articles is a systematic difference in their behaviors that is not captured by the benchmarks. This underscores the importance of developing new evaluation paradigms that can fully capture the differences between models.

Nomic Embedding API and Atlas Enterprise

We release the Nomic Embed model weights and full-training data for complete model auditability. Nomic recognizes enterprises require fully-auditable AI and we're proud to offer the first performant text embedding model that can achieve it. Contact Nomic to learn about Nomic Atlas Enterprise.

The best option to use Nomic Embed is through our production-ready Nomic Embedding API.

You can access the API via HTTP and your Nomic API Key:

curl https://api-atlas.nomic.ai/v1/embedding/text \
    -H "Authorization: Bearer $NOMIC_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{ "model": "nomic-embed-text-v1",
          "texts": ["Nomic AI introduces Nomic Embed", "#keepAIOpen"]}'

and in the official Nomic Python Client after you pip install nomic,

from nomic import embed
import numpy as np

output = embed.text(
    texts=[
        "Who is Laurens van der Maaten?",
        "What is dimensionality reduction?",
    ],
    model='nomic-embed-text-v1',
)

print(output['usage'])

embeddings = np.array(output['embeddings'])

print(embeddings.shape)

Nomic Embed on AWS Marketplace

In addition to using our hosted inference API, you can purchase dedicated inference endpoints on the AWS Marketplace. Please contact sales@nomic.ai with any questions.

Data Access

To access the full data, we provide Cloudflare R2 access keys to the buckets containing the data. To get access, create a Nomic Atlas account and follow the instructions in the contrastors repo.

Nomic asks that if you want to use a public inference service for accessing Nomic Embed, you choose the Atlas Embedding API. This allows Nomic to continue driving future open-source AI innovation. Remember, you can always access and run the model without usage restrictions by simply downloading the open-source model weights.