Author: Zach Nussbaum, Principal MLE and Max Cembalest, Developer Advocate

Nomic Embed's Surprisingly Good MTEB Arena Elo Score

Nomic Embed performance on MTEB Arena vs parameter count

How is Nomic Embed scoring as well as models 70x its size on the MTEB Arena leaderboard on HuggingFace?

Are large embedding models overfit to benchmarks? Do those benchmarks reflect user preference? Is evaluation really broken, or is it just not reflecting the whole picture?

Arena-Style Retrieval Evaluation for Embedding Models

The new Arena for embedding models on HuggingFace is a blind leaderboard aimed to provide a dynamic, real-world assessment of embedding model capabilities.

Users submit queries, see retrieved texts from a pair of anonymously selected embedding models, and vote on which of the two models retrieved the best response to the query. This process is very similar to the LMSYS Chatbot Arena, which conducts blind side-by-side preference voting for LLM responses instead of retrieval results.

Voting results are aggregated in the Elo score and plotted along the y-axis in the above chart. Nomic Embed scores alongside the strongest embedding models.

The arena-style voting paradigm, while conditional on the datasets being used (Wikipedia, Arxiv, and Stackexchange for the MTEB Arena), is showing early signs that a model as compact as Nomic Embed can be just as performant at retrieval as models many times its size.

How Arena-Style Evaluation Changes The Paradigm

Arena-style retrieval evaluation differs from the standard paradigm that has defined how embedding models have been benchmarked for the last few years.

Historically, embedding models have been primarily evaluated using static benchmark datasets like MTEB (Massive Text Embedding Benchmark) and BEIR. While these scores provide valuable baselines, static benchmarks can be overfit, where models are optimized for specific test sets rather than real-world performance. Many models today have been trained on the training sets of the benchmarks and/or trained on synthetic data that is similar to the benchmark data.

This new paradigm focuses on user preference as the key driver behind model performance, as opposed to performance measured by static benchmark tests. This aligns with our general vision at Nomic that users getting value out of the representations provided by our embedding models matters a lot more than how high our model scores at retrieval on the MTEB dataset.

The existence of static leaderboards has also yielded many benefits for the open source community: the structure provided and competition generated by leaderboards, such as those on HuggingFace and Kaggle, have played an enormous role at making viable and cost effective model development strategies more accessible to the public.

But the dynamic arena approach offers a more dynamic, user-driven, real-world assessment of embedding model capabilities.

The metric used to aggregate performance in an arena is called Elo, used to rank relative performance of chess players (or any other relative metrics, as Mark Zuckerberg used at college).

Nomic Embed: Performance & Efficiency

Nomic Embed is a long-context embedding model we released in February 2024. At the time, it was the first long-context embedding model to outperform OpenAI's text-embedding-ada-002 and text-embedding-3-small on both short and long context tasks.

We know that Nomic embed is one of the best embedding models out there, even though we trained it not only for performance but for compactness and speed. But we were naturally kind of nervous when we saw an arena-style competition go up--the LLM one has been hugely influential (some people claiming damagingly so!), and although we know we built a solid model, would it get swamped by larger models?

What we found instead was that the MTEB Arena validated what we were hearing from users of our embedding model. Nomic Embed performance packs above it's model size and hits the sweet spot between performance and model size.

We've plotted Nomic Embed's Elo score against the parameter count of our model against other models from the Arena. As you can see in the top-left corner of the chart, Nomic Embed is currently the optimal choice when viewed from this perspective.

Parameter count matters because the higher the parameter count, the more memory and time required to use the embedding model.

To make this effiency more concrete, here is a time comparison of the embedding models, showing that our smaller parameter count has the extremely practical benefit of making our model orders of magnitude faster to run than larger models.

With Nomic Embed, you can get closer to the sweet spot of performance & effiency.

Nomic Embed performance on MTEB Arena vs time

Compact, Open Source, SOTA model

Our aim is for Nomic Embed to be the best model in terms of performance, cost, speed, and memory. We open sourced our training data and training code, showing how we built a state of the art embedding model that is efficient to run on consumer hardware, fitting vectors to concepts with strong performance (at least at the level of state of the art scores on benchmarks against other top AI labs).

Our model, as a result of our commitment to cost-effectiveness, is quite compact. It doesn’t require as many parameters to map out concepts in vector space. As a consequence, this helps our model be much smaller, faster to download, and overall easier to work with, all while being state of the art at mapping the semantics of data.

Explaining Nomic Embed's Performance on Static Leaderboard vs. Dynamic Arena

On the static MTEB Leaderboard, Nomic Embed ranks in the top 50s. However, on MTEB Arena, Nomic Embed ranks similarly to top-10 MTEB Leaderboard models that are 70x bigger.

The performance gap between Nomic Embed's static MTEB Leaderboard and dynamic MTEB Arena results raises an important question: Are larger models overfitting the MTEB benchmark? Is a higher MTEB score a measure of better real-world performance?

Many newer models are trained on the BEIR training sets and generate synthetic data aimed to be similar to MTEB training data. Although the larger embedding models show improvements on static MTEB and BEIR datasets, the small gap between Nomic Embed and the larger models on Arena may suggest that higher static MTEB Leaderboard scores may not fully capture a model's real-world performance.

The Sweet Spot

The role that Nomic Embed plays in our vision for Nomic's future is multifaceted. No single score can capture what it means to get strong performance on all of these use cases at once, let alone what it means to get the right balance with speed, cost, and memory.

Nomic Embed provides multimodal vector space used by our Nomic Atlas product, allowing anyone to quickly understand the semantic contents of massive text & image datasets in minutes.

Data retrieval in GPT4All uses Nomic Embed, bringing semantically relevant document snippets into LLM chats quickly, privately, on-device, for free.

Nomic Embed was trained to also support common ML developer actions: namely clustering, semantic classification, and query -> document RAG retrieval, via task-specific prefixes baked into the model weights.

Instead of aiming for a better score on a single benchmark, to properly support all our intended use cases, we view our goal as hitting the "sweet spot": learn concepts at the right scale with the right number of parameters and trained on the right data. This requires getting many different things right all at once so that the total experience of using Nomic's models and software provides the best overall experience. We want the essence of using Nomic to be this: understand what is in your data as quickly as possible to build whatever it is you need to build. This philosophy drills all the way down into our vector representations: are they as complex as they need to be?

However, despite our aims that are hard to quantify, we are excited to see ongoing innovation in the evaluation space, and believe that benchmarking & standards-setting, while difficult to get perfect, is extremely important to move in the right direction. Our aim is to move our own models in the direction of the sweet spot.

Chart Crimes

It's risky to make strong claims about any model - including Nomic Embed - from charts that compare benchmark scores.

For one, the MTEB Arena scores use Wikipedia, Arxiv, & Stackexchange data. If your use case involves different kinds of documents, your application performance may not strongly correlate with MTEB Arena Elo.

But more importantly, anyone can make a chart with an x-axis and a y-axis of their choosing. It takes more than optimizing public benchmarks to make the right model.

Using Nomic Embed

The easiest way to get embeddings with Nomic Embed is using the Nomic Python client, either running locally for free on your device, or running remotely via the Nomic API.

You can see detailed informationa about how to use our embedding models on HuggingFace.

Further instructions for getting embeddings via the Nomic Python client can be found in our documentation.

In addition to using our hosted inference API, you can purchase dedicated inference endpoints on the AWS Marketplace. Please contact sales@nomic.ai with any questions.