Mapping LLM embeddings in three dimensions

Visualising LLM embeddings in 3D space using SVG and principle component analysis.

Embeddings are long lists of seemingly inscrutable numbers created by an LLM, but if you're prepared to lose a little precision you can force them into 3D space where they're a little easier to comprehend and explain.

fig #1: Headlines from a galaxy far, far away.

Plotting the semantic "meaning" of strings of text that have been converted to embeddings by an LLM. Hover over points to see the original text value and use the "rotation" slider to create a parallax effect.

What am I trying to show here?

I talked a lot about "embeddings" in a recent article about finding related posts within my blog archive. In that article I showed how embeddings could be used to calculate similarities between different chunks of text, and while writing it I had to work out how best to describe what embeddings actually are.

The most obvious description I found for embeddings was the one most closely tied to how I was using them; I was using them to calculate how "close" two piece of text were to each other, so that naturally led to relying on physical space as a useful metaphor. Sure, the embeddings I use are coordinates in fifteen-hundred-dimensional space, but if you squint your eyes just so you can shrug and say "coordinates are coordinates". As I said in that article:

"by using embeddings you're creating a 'map' of the meaning of your content"

Me, a few weeks ago

Over the past few months I've found this to be a really helpful way to think about embeddings, and I've been describing this theoretical "map" whenever I've mansplained embeddings to people (apologies if you've met me lately and been bored stiff by my constant LLM talk).

How can we actually visualise this map?

It's all well and good to describe a theoretical multi-dimensional "map of content", but quite another to come up with a way to visualise it without access to a tesseract or any other n-dimensional hypercube. As I mentioned in the previous article, I've been using OpenAI's ada-002 model to create my embeddings, and the embeddings it creates are "vectors" made up of 1,536 numbers. To plot these vectors on a graph you'd need to (somehow) visualise 1,536-dimensional space. That's an impossibility, so the only practical option is to reduce the number of dimensions we're dealing with. Thankfully there is a mathematical tool we can put to use to achieve this, provided we're happy to lose some data in the process.

Principle component analysis is a "dimensionality reduction" technique that allows us to take a large-dimensional dataset and convert it into a smaller set of dimensions. This is a new technique to me (and I copy/pasted most of the code I used) but to the best of my understanding a "principal component" can be seen as a "direction" in multidimensional space. In a 3D space, the directions are up/down, left/right, forward/backward (the classic x, y, z coordinates). In 4D, 5D, n-D space, the same idea applies. Principle Component Analysis (PCA) works out which dimensions have the most variance and ignores the ones with less variance.

It's an inherently lossy processes as we're literally ignoring parts of the data, but it's a good way to get the general "vibe" for a dataset. I'm happy listening to MP3s and looking at JPEGs online as the compression benefits outweigh the loss in fidelity. And the embeddings I've been working with represent such a small dataset that I'm happy to lose some detail in exchange for the added comprehensibility.

The practical implementation of PCA into my project was actually fairly trivial thanks to the ml-pca Node.js package, so now I can take my list of embeddings (with each embedding being an array of 1,536 numbers) and transform them into arrays of just three numbers.

import { PCA } from "ml-pca";
import { embeddings } from "./my-local-data.js"

const pca = new PCA(embeddings);

// Get the first three principal components
const { data } = pca.predict(embeddings, { nComponents: 3 });

I did originally include an example of a full embedding here, but it was just too much scrolling even if it was funny to see the full before/after comparison. Suffice it to say, this processes turns a big array into a small array.

What does this tell me about my data?

Given that I'd already created an embedding for each of my blog posts, the next natural step was to run a PCA on them and plot them in a 3D scatter graph. The labels on the graph show just the post titles, but the embeddings were calculated from the entire content of each post. The primary category each post belongs to is manually set by me when I write each post, and the colours of the point on the graph each correspond to a specific category.

fig #1: A "map" of my blog posts

Each post's position in space is determined by the embedding of the content (a.k.a. they're arranged by semantic meaning).

This graph is pretty, but currently just serves as a novelty. It was a lot of fun to make, as I've done a lot of 2D SVG graphing but never anything that tried to show three dimensions like this (and I'm rather pleased with how it turned out). I find it fascinating to see "objective" relationships between content I've written over the last ten years, especially the little clusters of topics...

There are a few massive outliers where I spent a month writing about sports prediction. There's a pleasing "podcasting zone" over to the right and nice little typography corner down the bottom, too. While not being intrinsically useful yet, it's already got me rethinking how I categorise my archive, and also got me thinking about clusters I'd like to flesh out more (and gaps that might need filling!).

Importantly, what it does do is show visually how my cosine similarity calculations were working out which posts were similar. A LOT of the dimensions have been removed, but the basic premise is still clearly shown. **This graph is a much more concrete illustration of "embedding space" than my previous hand-wavey, back-of-a-napkin attempts ever were. It also has the benefit of being based on real data (even if it is greatly optimised).

When I'm describing embeddings as a "map of content" I can now show them this graph to make my point more clearly than my bumbling words ever could.

This is not a true "map" of embedding space

In some ways this is not the post I set out to write.

When I was first experimenting with PCA, I assumed that the outputs represented absolute values. I though that an embedding for a string of text would always result in the same 3D coordinates and PCA would "simplify" all embedding data in the same way. I naively thought that PCA would allow me to map any embedding into the same 3D space. I was wrong about this.

Because the principal components are determined by the variance in the data, different data sets will have more or less variance in different dimensions. The PCA "space" that gets created is unique to each set of embeddings that the PCA has been run with. The "map" of my blog posts will be drawn in a very different space to the map of your blog posts, so we couldn't combine the two sets of 3D coords. But if we combined our embeddings and ran a PCA for all of them at the same time then we would have a map of our combined posts but the positions wouldn't match the original independent maps.

My original plan was to build the graph framework and have a form input where anyone could type in some text and see it added into the embedding space. I thought I could make a real map of all embeddings. PCA just doesn't work like that. I could (and might yet?!) create a page where you can input a whole set of text strings and have it calculate the embeddings and PCA'd 3D map in real-time, but what with building the SVG cube graph from scratch I've already burned more time on this post that I intended. That'll have to wait for a future project.

What next?

Aside from building the interactive version of the graph, what are the next steps for this? Crucially, how can I turn this curiosity into something genuinely useful?

Fancy data vis

I love a good piece of data visualisation, and embeddings open up a host of new areas for me to experiment with. And if I use embeddings and PCA to generate "spatial" graphs and charts, it's a small step to pair that with other factors to tell interesting stories with data.

Search index size optimisation

Following on from my "related posts" experiments, I've been testing out embeddings as the basis for a genuine site-search engine. It "kind of" works currently, but the biggest issue is that embeddings are so damn large! My current set of "whole file" embeddings comes in at about 1.6 MB, which is more than three times the size of my real search index (which, at 419 KB contains all the text content of all my pages). Yep, the JSON file with a single embedding for each post is larger than the actual content of all those posts. And to get search results that compete with, say, Fuse.js' matching engine I'd need even more embeddings; ideally one per paragraph of text. Totally un-useable for the statically-generated client-side-only approach I prefer.

Applying principle component analysis to those embeddings might (maybe) change that. I doubt three dimensions will produce good search results, but perhaps even halving the number of dimensions might still produce useable results while being a more manageable size. I'll definitely be doing more experimentation around this in the coming weeks.

Wherever I go with this next, I'm having a lot of fun and I'm definitely not finished yet.

Related posts

If you enjoyed this article, RoboTom 2000™️ (an LLM-powered bot) thinks you might be interested in these related posts:

TomBot2000: automatically finding related posts using LLMs

How I used LLM embeddings to find related posts for my statically-generated blog and then used GPT4 to explain why they're similar.

Similarity score: 74% match . RoboTom says:

Adding client side search to a static site

Creating a site-search function that doesn't rely on external services

Similarity score: 67% match . RoboTom says:

Signup to my newsletter

Join the dozens (dozens!) of people who get my writing delivered directly to their inbox. You'll also hear news about my miscellaneous other projects, some of which never get mentioned on this site.

    Newer post:

    So long, and thanks for all the Sass

    Published on

    Older post:

    Publishing on npm is weird

    Published on