If you haven't read our previous posts on data maps and embeddings, we recommend starting there for some useful background!
In our previous posts, we explored how data maps help visualize relationships in your data and how embeddings transform raw data into a powerful numerical format that captures semantic meaning. Through this transformation, each document, image, or piece of content becomes a vector with hundreds or thousands of dimensions.
This leads us to a fundamental challenge: how do we visualize these rich, high-dimensional embeddings in a way humans can understand? While these dense vectors are ideal for computers—allowing us to efficiently store and operate on vast amounts of semantic information—they are impossible for humans to visualize directly.
Dimensionality reduction means preserving meaningful relationships in high-dimensional data while dramatically simplifying the data by re-representing everything in a lower dimensional space. Points that are similar in the original high-dimensional space should remain close together in the visualization, while dissimilar points should stay far apart. This preservation of relationships is essential for creating meaningful visualizations that humans can interpret and trust.
Consider the centuries-old challenge of geographic cartography: our planet exists in three dimensions, but to represent it on a map we need to encode three-dimensional information on a two-dimensional surface. Cartographers developed various map projections, each preserving different aspects of Earth's geography—some maintaining accurate areas, others preserving angles or distances.
Image from Bellerby & Co Globemakers
Dimensionality reduction algorithms applied to hundred- or thousand-dimensional data faces a similar challenge, but at a much larger scale. Instead of reducing from three dimensions to two, these algorithms must preserve the essential relationships that exist in hundreds or thousands of dimensions while creating a 2- or 3-dimensional representation that human eyes can comprehend. Just as different map projections serve different purposes, various dimensionality reduction techniques make different trade-offs in how they preserve high-dimensional relationships in their two-dimensional representations.
Dimensionality reduction is fundamentally about making tradeoffs. No single method is perfect; each technique sacrifices some information to reveal high-dimensional patterns in a lower-dimensional space. Consequently, every approach has its own strengths and weaknesses.
To see how dimensionality reduction works on high-dimensional data, we'll show what it looks like to apply it on the MNIST dataset of handwritten digits.
Examples of handwritten digits from the MNIST dataset
Each image is 28x28 pixels, meaning each item in this dataset is 784-dimensional. By using dimensionality reduction, we can get a view of all 70,000 different images in this dataset on one 2D data map.
When a dimensionality reduction algorithm is working well, we should generally see similar items cluster together - meaning the map should group the 0s together, the 1s together, etc. Additionally, we will show that with Nomic Project (our dimensionality reduction algorithm), there are other qualitative aspects of the data that dimensionality reduction can reveal about this dataset!
We've applied Principal Component Analysis (PCA) for dimensionality reduction. While it's computationally efficient - taking only ~2 seconds to compute on this dataset - the results are not ideal: the different digit classes tend to overlap with each other, making it difficult to distinguish between different classes of the data.
t-SNE is much better than PCA at preserving local structure in the data, meaning it keeps similar points close together while separating dissimilar points. Unlike PCA which is a linear projection, t-SNE can capture non-linear relationships in the data, allowing it to find more complex patterns and create more meaningful clusters.
Now in this view, for dimensionality reduction we are using UMAP, which was developed by Leland McInnes, John Healy, and James Melville in 2018.
UMAP is faster than t-SNE, and while t-SNE excels at preserving local structure (keeping similar points together), UMAP is better at maintaining both local and global relationships. This means UMAP not only creates clear clusters but also better preserves the overall structure and relationships between different regions of a dataset.
Finally, we're looking at the data through Nomic Project, our own dimensionality reduction algorithm. Like UMAP and t-SNE, it creates clear clusters while preserving meaningful structure, but with a key advantage: computational efficiency. The algorithm scales nearly linearly with data size, unlike t-SNE and UMAP which scale at O(n log n). This efficiency enables the creation of high-quality data maps even for extremely large datasets for our users.
We believe dimensionality reduction is a fundamentally important building block: we are focused on this because this is an area of fundamental research upstream of nearly everything in AI. Our research focuses on making it more accurate and scalable with Nomic Project, the dimensionality algorithm that Atlas runs which builds on the foundations laid by t-SNE and UMAP. We're particularly focused on improving how we approximate relationships in the reduced space, allowing us to efficiently process millions of data points while maintaining the integrity of their relationships.
Our development benefits from direct collaboration with the very pioneers who revolutionized this field. Our advisors include Dr. Laurens van der Maaten, co-creator of t-SNE, and Dr. Leland McInnes, co-creator of UMAP—bringing their deep expertise and insights to help guide our advances in dimensionality reduction.
The reasons that we are improving dimensionality reduction at the level of algorithms? We want to create more accurate data maps that better reflect your actual data. We want to give more people the ability to work with data maps of larger datasets. And, we want faster processing that can spin up data maps more immediately for our users. In short, we want to make data mapping more capable and accessible for everyone!