We're excited to announce the release of Nomic Embed Multimodal, a suite of models that achieve state-of-the-art performance in embedding PDFs, images, papers, and charts.
This release includes four models, available in two sizes (3B and 7B parameters) and two variants:
Our best model, ColNomic Embed Multimodal 7B, achieves 62.7 NDCG@5 on Vidore-v2, a visual document retrieval benchmark focused on page-level retrieval, with a +2.8 point improvement over the previous state-of-the-art models. Additionally, Nomic Embed Multimodal 7B outperforms all other single-vector models on the benchmark.
Documents are visually rich structures that convey information not just through text, but through figures, page layouts, tables, and even fonts. Traditional retrieval systems primarily rely on extracted text, missing these crucial visual elements and often requiring complex, error-prone processing pipelines. Nomic Embed Multimodal, inspired by ColPali and DSE, solves this problem by supporting interleaved text and image inputs, making it ideal for:
Before multimodal embedding models, representing multimodal data for retrieval required:
In contrast, VLMs provide a simple and accurate way to embed image and text data with a single model, eliminating these complexities while improving performance. This approach delivers superior accuracy compared to text-only approaches that require OCR, while being faster than complex pipelines with multiple processing steps. It also provides more comprehensive capture of visual information by directly processing images alongside related text.
Both the ColPali and DSE approach significantly outperform multimodal models with CLIP-style architectures by addressing the modality gap through interleaved processing of text and images.
The ColNomic Embed Multimodal models use a multi-vector late interaction mechanism. Instead of creating one embedding per document or query, ColNomic creates multiple embeddings. This allows for more precise matching during retrieval and leads to better performance.
Building on these advances, we applied our learnings from training high-performance text embeddings to create even better multimodal embeddings. Starting with Qwen2.5-VL 3B Instruct as our baseline, we implemented several key improvements:
We discovered that naive sampling across dataset sources allows models to learn shortcuts rather than semantic relationships. By forcing sampling from the same source, we create harder in-batch negatives that prevent the model from "cheating" and improve its understanding of content relationships.
Result: +2.9 point improvement on Vidore-v2 NDCG@5
We trained an initial dense model on the ColPali training dataset and VDR Multilingual Train Dataset, then used it to retrieve the top-k nearest neighbors for each query.
Additionally, we reduced false negatives using positive-aware hard negative mining, a technique first introduced in NV-Retriever.
Results:
VLMs like Nomic Embed Multimodal simplify how RAG systems handle documents with rich visual content. Documents with with equations, diagrams, charts, and tables provide essential context alongside the text.
Technical documentation presents similar challenges - code blocks, flowcharts, and screenshots need to be understood together with their surrounding text. The same applies to product catalogs with specifications and images, or financial reports containing charts and numerical data.
By embedding visual and textual content together, retrieval becomes more accurate and integrations into real systems become much easier to implement and experiment with. Removing preprocessing steps often makes indexing faster and reduces complexity, and the single API for both images and text keeps implementations straightforward.
Nomic Embed Multimodal offers state-of-the-art performance while substantially simplifying the retrieval pipeline. As part of the broader Nomic Embed Ecosystem, this technology demonstrates our commitment to pushing the boundaries of embedding capabilities. To read more about our complete ecosystem, you can learn more on our detailed blog post.
You can get started with our new model collection here in Hugging Face:
And here are some guides demonstrating how to use the new models as retrievers for RAG workflows: