Author: Nomic Team

Data Mapping Series
Part 2: Embeddings Data Mapping Series
Part 4: Web Graphics

Data Mapping Series

This is our Data Mapping Series, where we do a deep dive into the technologies and tools powering the unstructured data visualization and exploration capabilities of the Nomic Atlas platform. In this series you'll learn about how machine learning concepts like embeddings and dimensionality reduction blend with scalable web graphics to enable anyone to explore and work with massive datasets in their web browsers.

If you haven't read our previous posts on data maps and embeddings, we recommend starting there for some useful background!

See Your Data With Dimensionality Reduction

In our previous posts, we explored how data maps help visualize relationships in your data and how embeddings transform raw data into a powerful numerical format that captures semantic meaning. Through this transformation, each document, image, or piece of content becomes a vector with hundreds or thousands of dimensions.

This leads us to a fundamental challenge: how do we visualize these rich, high-dimensional embeddings in a way humans can understand? While these dense vectors are ideal for computers—allowing us to efficiently store and operate on vast amounts of semantic information—they are impossible for humans to visualize directly.

Dimensionality Reduction Is Like Flattening The Globe

Dimensionality reduction means preserving meaningful relationships in high-dimensional data while dramatically simplifying the data by re-representing everything in a lower dimensional space. Points that are similar in the original high-dimensional space should remain close together in the visualization, while dissimilar points should stay far apart. This preservation of relationships is essential for creating meaningful visualizations that humans can interpret and trust.

Consider the centuries-old challenge of geographic cartography: our planet exists in three dimensions, but to represent it on a map we need to encode three-dimensional information on a two-dimensional surface. Cartographers developed various map projections, each preserving different aspects of Earth's geography—some maintaining accurate areas, others preserving angles or distances.

Dimensionality reduction for data mapping is like flattening the globe while trying to preserve country and landmass shapes.

Image from Bellerby & Co Globemakers

Dimensionality reduction algorithms applied to hundred- or thousand-dimensional data faces a similar challenge, but at a much larger scale. Instead of reducing from three dimensions to two, these algorithms must preserve the essential relationships that exist in hundreds or thousands of dimensions while creating a 2- or 3-dimensional representation that human eyes can comprehend. Just as different map projections serve different purposes, various dimensionality reduction techniques make different trade-offs in how they preserve high-dimensional relationships in their two-dimensional representations.

Dimensionality reduction is fundamentally about making tradeoffs. No single method is perfect; each technique sacrifices some information to reveal high-dimensional patterns in a lower-dimensional space. Consequently, every approach has its own strengths and weaknesses.

Example: MNIST

To see how dimensionality reduction works on high-dimensional data, we'll show what it looks like to apply it on the MNIST dataset of handwritten digits.

Examples of handwritten digits from the MNIST dataset

Each image is 28x28 pixels, meaning each item in this dataset is 784-dimensional. By using dimensionality reduction, we can get a view of all 70,000 different images in this dataset on one 2D data map.

When a dimensionality reduction algorithm is working well, we should generally see similar items cluster together - meaning the map should group the 0s together, the 1s together, etc. Additionally, we will show that with Nomic Project (our dimensionality reduction algorithm), there are other qualitative aspects of the data that dimensionality reduction can reveal about this dataset!

This is an Atlas data map of the MNIST digits. We've color-coded the points based on which numerical digit they represent. If you are viewing this on a desktop or laptop, you can hover over each point to see the images appear!

We've applied Principal Component Analysis (PCA) for dimensionality reduction. While it's computationally efficient - taking only ~2 seconds to compute on this dataset - the results are not ideal: the different digit classes tend to overlap with each other, making it difficult to distinguish between different classes of the data.

Now in this view of the dataset, for dimensionality reduction we are using t-SNE, which was developed in 2008 by Dr. Laurens van der Maaten and Geoffrey Hinton. We can clearly see 10 distinct clusters now!

t-SNE is much better than PCA at preserving local structure in the data, meaning it keeps similar points close together while separating dissimilar points. Unlike PCA which is a linear projection, t-SNE can capture non-linear relationships in the data, allowing it to find more complex patterns and create more meaningful clusters.

Now in this view, for dimensionality reduction we are using UMAP, which was developed by Leland McInnes, John Healy, and James Melville in 2018.

UMAP is faster than t-SNE, and while t-SNE excels at preserving local structure (keeping similar points together), UMAP is better at maintaining both local and global relationships. This means UMAP not only creates clear clusters but also better preserves the overall structure and relationships between different regions of a dataset.

Finally, we're looking at the data through Nomic Project, our own dimensionality reduction algorithm. Like UMAP and t-SNE, it creates clear clusters while preserving meaningful structure, but with a key advantage: computational efficiency. The algorithm scales nearly linearly with data size, unlike t-SNE and UMAP which scale at O(n log n). This efficiency enables the creation of high-quality data maps even for extremely large datasets for our users.

Let's see some of the aspects of the data that Nomic Project reveals!

Zooming in on the cluster of 0s, we can see some sub-regions breaking off the main cluster. They show meaningful sub-structure within this class of the data distribution.

Here we can see a region of the 0 neighborhood with the 0s that were drawn with a bit of a pinch at the top!

And here we can see a region of the 0 neighborhood with the 0's that were drawn with a strong diagonal tilt!

Here's another aspect of the data distribution that dimensionality reduction with Nomic Project captured:

We're viewing the data with a filter for points with the label "7", and we can see two distinct clusters that Nomic Project has separated out. What do they represent?

This larger neighborhood shows 7s drawn in the typical fashion...

...whereas the smaller neighborhood contains the 7s drawn atypically, with a line crossing through them!

Another neighborhood of the dataset that Nomic Project captures lies towards the center of the map showing lots of digits that are generally drawn worse than the rest of the dataset!

Data maps leveraging dimensionality reduction are game-changing tools for finding patterns in high-dimensional data. They let you explore and understand your complex data in ways that were previously impossible!

Why Nomic Is Working On Dimensionality Reduction Algorithms

We believe dimensionality reduction is a fundamentally important building block: we are focused on this because this is an area of fundamental research upstream of nearly everything in AI. Our research focuses on making it more accurate and scalable with Nomic Project, the dimensionality algorithm that Atlas runs which builds on the foundations laid by t-SNE and UMAP. We're particularly focused on improving how we approximate relationships in the reduced space, allowing us to efficiently process millions of data points while maintaining the integrity of their relationships.

Our development benefits from direct collaboration with the very pioneers who revolutionized this field. Our advisors include Dr. Laurens van der Maaten, co-creator of t-SNE, and Dr. Leland McInnes, co-creator of UMAP—bringing their deep expertise and insights to help guide our advances in dimensionality reduction.

The reasons that we are improving dimensionality reduction at the level of algorithms? We want to create more accurate data maps that better reflect your actual data. We want to give more people the ability to work with data maps of larger datasets. And, we want faster processing that can spin up data maps more immediately for our users. In short, we want to make data mapping more capable and accessible for everyone!

Data Mapping Series
Part 2: Embeddings Data Mapping Series
Part 4: Web Graphics