Embeddings
Training Sparse MoE Text Embedding Models
Introduces the first general purpose mixture of experts text embedding model, which achieves state-of-the-art performance on the MIRACL benchmark. The model is truly open source, meaning the training data, weights, and code are available and permisively licensed.
CoRNStack: High-Quality Contrastive Data for Better Code Ranking
An Open Dataset for training State-of-the-Art Code Embedding Models. Work done in collaboration with University of Illinois at Urbana-Champaign.
Nomic Embed: Training a Reproducible Long Context Text Embedder
The first truly open (i.e. open data, weights, and code) text embedding model that outperforms OpenAI Ada. Work done in collaboration with Cornell University.
Nomic Embed Vision: Expanding the Latent Space
The first multimodal embedding model to achieve high performance on text-text, text-image, and image-image tasks with a single unified latent space.
Embedding Based Inference on Generative Models
An extension of Data Kernel methods to black box settings. Work done in collaboration with Johns Hopkins University.
Language Models
Tracking the Perspectives of Interacting Language Models
Developing and studying metrics for understanding information diffusion in communication networks of LLMs. Work done in collaboration with Johns Hopkins University.
GPT4All: An Ecosystem of Open Source Compressed Language Models
How the first open source LLM to surpass GPT-3.5's performance grew from a model into a movement. Work done in collaboration with the GPT4All community.
Comparing Foundation Models using Data Kernels
A method for statistically rigorous comparison of embedding spaces without labeled data. Work done in collaboration with Johns Hopkins University.
Dimensionality Reduction
HUMAP: Hierarchical Uniform Manifold Approximation and Projection
A hierarchical and flexible dimensionality reduction algorithm that preserves local and global structures for accurate, efficient data exploration. Collaboration with São Paulo State University (Brazil), Eindhoven University of Technology (the Netherlands), and Linnaeus University (Sweden).
The Landscape of Biomedical Research
The first systematic study of the entirety of PubMed from an information cartography perspective. Work done in collaboration with University of Tubingen.
Mapping Wikipedia with BERT and UMAP
The first systematic study of the entirety of English Wikipedia from an information cartography perspective. Work done in collaboration with New York University.

