Author: Nomic Team

Local Nomic Embed: Run OpenAI Quality Text Embeddings Locally

On February 1st, 2024, we released Nomic Embed - a truly open, auditable, and highly performant text embedding model. Since this release, we've been excited to see this model adopted by our customers, inference providers and top ML organizations - trillions of tokens per day run through Nomic Embed.

At Nomic we understand that engineering teams value flexibility and optionality in their machine learning stack.

We're excited to introduce an officially supported fully local version of Nomic Embed that exactly matches our remote inference API powered by GPT4All local inference. Just specify inference_mode='local' and you are set.

pip install nomic

from nomic import embed
import numpy as np

output = embed.text(
    texts=[
        'Nomic Embed now supports local and dynamic inference to save you inference latency and cost!',
        'Hey Nomic, why don\'t you release a multimodal model soon?',
    ],
    model='nomic-embed-text-v1.5',
    task_type="search_document",
    inference_mode='local',
    dimensionality=768,
)

print(output['usage'])

embeddings = np.array(output['embeddings'])

print(embeddings.shape)

Dynamic Inference Mode: Remote and Local

Nomic Embed dynamic inference switches between the local model instance and a remote API based on statistics of your input text.

This allows for small inputs to run on your local machine (e.g. MacBook, EC2 instance) and larger queries to dynamically route to the remotely hosted Nomic Embedding API.

This new capability allows engineers to walk the Pareto frontier of embedding latency and cost. For short text inputs, the network round trip (100-200ms) to a hosted API is slower than running inference locally - in this case, dynamic mode will just run the inference locally and free of charge.

For queries with a long sequence length or large batch size, remote inference is faster so dynamic inference will route queries to the Nomic Embedding API.

How does local and dynamic inference mode work?

Local inference works by running on a ggml graph of Nomic Embed via GPT4All. When you request local inference, the model will automatically download to your machine and be used for embed.text requests.

Dynamic mode switches between local and remote API mode with the objective of saving inference latency and cost. For example, on a 2023 MacBook Pro (16GB), local mode is faster than remote inference (assuming 200ms network latency) for up to 1024 token length inputs.

What software, models and hardware are supported?

Currently only the Nomic python bindings to the Nomic Embedding API support local mode.

All current Nomic Embed models including nomic-embed-text-v1 and nomic-embed-text-v1.5 with binary, resizable embeddings are supported.

Local inference mode supports any CPU or GPU that GPT4All supports, including Apple Silicon (Metal), NVIDIA GPUs, and discrete AMD GPUs.

How fast is local mode?

We benchmark Nomic Embed in local inference mode across a variety of CPUs and GPUs with a batch size of one. It's usually preferable to use remote mode with batch sizes larger than one. The below chart is constructed by sampling inference latency on a variety of target hardware devices supported by GPT4All.

In general, you should prefer to use Nomic Embed local mode in any configuration below the Nomic API curve. Dynamic inference mode will automatically detect these cases and switch to the local model.

Note that local inference latency on Mac Metal is quite competitive at long sequence lengths.

Can I use local mode with LangChain or LlamaIndex?

When using the LangChain or LlamaIndex integrations, the inference_mode and device parameters work the same as with embed.text. For LangChain:

from langchain_nomic import NomicEmbeddings

embeddings = NomicEmbeddings(
    model='nomic-embed-text-v1.5',
    inference_mode='local',
    device='gpu',
)

result = embeddings.embed_documents(['text to embed'])

For LlamaIndex:

from llama_index.embeddings.nomic import NomicEmbedding

embed_model = NomicEmbedding(
    model_name='nomic-embed-text-v1.5',
    inference_mode='local',
    device='gpu',
)

result = embed_model.get_text_embedding('text to embed')

Learn more about how to use local and dynamic Nomic Embed inference in the official documentation.