Author: Nomic Team

Data Mapping Series
Part 3: Dimensionality Reduction

Data Mapping Series

This is our Data Mapping Series, where we do a deep dive into the technologies and tools powering the unstructured data visualization and exploration capabilities of the Nomic Atlas platform. In this series you'll learn about how machine learning concepts like embeddings and dimensionality reduction blend with scalable web graphics to enable anyone to explore and work with massive datasets in their web browsers.

If you haven't read our previous posts on Data Maps, embeddings, and dimensionality reduction, we recommend starting there for some useful background!

Why Are Web Browsers The Best Data Browsers?

Throughout this series, we've demonstrated how data maps powered by embeddings and dimensionality reduction unlock powerful ways to explore massive datasets. But having great algorithms isn't enough - you also need an interface that makes this exploration intuitive and accessible.

In this post, we'll explain why we believe web browsers are really well suited for interactive data exploration, making powerful data mapping algorithms available not just to experienced developers, but to anyone who wants to understand their data.

Accessibility

The browser is the right way to reach people. Everyday users already have browsers on their computer, and they don't need to install additional software to take advantage of all the tech built into their browsers that make data exploration a smooth and rich experience.

Speed

Modern web browsers are incredibly fast, in no small part thanks to years of optimization by companies like Google and Mozilla on making JavaScript a high-performance language for the express purpose of making web browsers fast.

The browser also has the ability to use your computer's GPU (graphics processing unit) automatically. GPUs are great at running many calculations at the same time in parallel, which is much more efficient when you want to run the same kind of operations for each point in large datasets.

Seeing Metadata Alongside Data Maps

Another reason, less about the underlying tech and more about the user experience, is that the web browser as an environment for rendering data comes with natural ways to get important metadata like images, TikToks, and Tweets (Xeets?) to show up alongside the data you are exploring.

Deepscatter: The Interactive Graphics Engine Powering Atlas In The Browser

To fully leverage these browser capabilities and overcome traditional visualization limitations, we are building Deepscatter, our custom web graphics engine. It's designed specifically to handle the challenges of visualizing millions of data points while taking advantage of modern browser technologies.

Traditional web visualization libraries struggle when dealing with massive datasets, often becoming sluggish or unresponsive. Deepscatter takes a different approach - it operates entirely client-side and loads only the data necessary for what you're currently viewing.

One of Deepscatter's key advantages comes from its use of Apache Arrow, which enables efficient memory management through contiguous memory blocks. Modern JavaScript's support for typed arrays means we can transfer data directly from tools like DuckDB, pandas, or Polars without costly serialization or deserialization steps.

All data in Deepscatter is transmitted using the Apache Arrow feather format, organized in a custom quadtree structure that enables selective loading based on the current zoom level. A quadtree recursively subdivides space into smaller and smaller regions in a hierarchical structure where each level provides increasingly fine-grained detail. This allows us to efficiently load only the appropriate level of detail needed for the current view, similar to how map applications like Google Maps show more detailed information as you zoom into specific areas.

Homer Simpson represented in quadtree format

Champagne tower showing hierarchical flow

The quadtree data structure, as seen on the left in rendering Homer Simpson (image credit to Alex Wakeman), demonstrates how space can be recursively subdivided into smaller regions only when more detail is necessary. Larger squares are used for areas with less detail, while smaller squares provide higher resolution where needed - similar to how modern map applications adjust detail levels based on zoom level. You can think of it like a champagne tower, where each level represents an new level of granularity the data structure starts to flow data into when necessary (image credit to Filaos)

Sending Computation To The GPU

Using typed Arrow arrays enables efficient communication with the GPU through optimized memory management. Deepscatter's rendering system leverages WebGL, with REGL handling efficient GPU buffer management. We've carefully optimized the rendering pipeline by minimizing draw calls between the CPU and GPU, which create a significant performance bottleneck. By executing most visual transformations directly on the GPU, we achieve smooth, parallel processing for transitions and animations. Internally, we're looking at WebGPU as the future once Firefox and Safari offer native support for WebGPU.

Seeing Deepscatter In Action

We'll use the same data map from Part 1 and Part 2 of this blog series to show some of the capabilities that the Deepscatter engine enables when exploring data maps in Atlas.

This data map visualizes over 1 million Wikipedia biographies, with each point representing a person. The position of each point is determined by the semantic similarities between their biographies.

Seeing Searches In Context

When you search within Atlas, the results are highlighted while maintaining their context within the broader dataset. This makes it easy to understand not just what matches your search, but how those results relate to the rest of your data.

Smooth Interactive Experience

Thanks to modern browser capabilities and GPU acceleration, zooming in to explore specific regions of your data is seamless.

The same smooth performance applies when zooming back out to see the full context of your dataset.

Smooth Transitions Between Layouts

Atlas can seamlessly switch between different layouts of your data. Here, we've moved from the semantic layout to a geographic view based on birthplace coordinates.

These transitions are powered by point-to-point animations running directly in your browser's GPU. Each biography smoothly moves from its semantic position to its geographic position.

Bridging Geographic And Semantic Spaces

This clip demonstrates one of the unique advantages that Deepscatter enables for data analysis using Atlas. Initially, data is plotted geographically, allowing users to interact with the data by selecting specific areas with the mouse. Subsequently, the projection can be changed to a semantic view, utilizing embeddings and dimensionality reduction as discussed in our previous blogs. This transition enables users to observe the different semantic categories within the geographically selected data in the context of the broader dataset's semantic categories discovered by our models running in Atlas!

Data exploration should be as natural as web browsing itself. Modern web browsers provide near-universal accessibility and leverage advanced graphics computation for fast performance, making them ideal for interactive data visualization with tools like Deepscatter and Atlas.

Data Mapping Series
Part 3: Dimensionality Reduction