How does cosine similarity work?
When working with LLM embeddings, it is often important to be able to compare them. Cosine similarity is the recommended way to do this.
Look up "how to compare vectors" and cosine similarity will be the most common (if not the only) approach you will see. I've been working with vectors a lot lately in the context of LLM embeddings, and being able to measure how similar any two embeddings are has become an important part of my workflow. But how does the cosine similarity process actually work?
I've been relying on copy/pasting cosine similarity code without really understanding how it works. To give myself a deeper understanding, I want to answer the following questions:
- How does the cosine similarity formula work?
- What do the different parts of it mean?
- Why is this a useful method for comparing LLM embeddings?
What do we mean by "vectors"?
Before getting too deep, it's worth clarifying what we mean by "vectors". For my projects I'm using the terms "embedding" and "vector" interchangeably. To quote a previous post of mine:
The general concept of "embeddings" is an offshoot of the Large Language Model (LLM) technology that makes tools like ChatGPT work. The basic idea is that you can take a piece of text (a blog post, for example) and turn it into a vector (an array of numbers). This vector is called an "embedding" and it represents the "meaning" of the text.
In short, embeddings are vectors and vectors are lists of numbers. If, like me, you translate all code into JavaScript, we're talking about arrays. const vector = [0.1, 0.2, 0.3, 0.4, 0.5];
is a vector.
The vectors created for LLM embeddings are very long arrays of numbers. An embedding created by OpenAI's ada-002
model contains 1,536 numbers. Mathematically that can be described as a vector in 1,536-dimensional space. 2D vectors (that is to say, arrays with only 2 numbers) are much less useful in the context of embeddings, but the same principles apply, and we can use them to illustrate how cosine similarity works.
Why is cosine similarity useful when comparing vectors?
Being able to compare how similar two vectors are is a key part of working with embeddings. Cosine similarity is the recommended way to do this.
If we picture a super-simplified 2D space where all our vectors have only two values, the angle between two vectors is the angle between the two lines they represent. The lines are drawn from the origin (0, 0) to the end of the vector, treating the two vector numbers as x/y co-ordinates.
Note: if we had vectors with three values, we'd be working in 3D space. Four values, 4D space, and so on. The principles of cosine similarity remain the same no matter how many dimensions you're working in.
The θ (theta) value is the angle between the two vectors. This is the angle required to rotate one vector to align with the other. The cosine of that angle (cos(θ)) gives us the cosine similarity: a number between -1 and 1.
If the directions of the vectors are identical, the cosine similarity is 1. If they're orthogonal (at right angles), the cosine similarity is 0. If they're opposite, the cosine similarity is -1.
Cosine similarity ignores the magnitude of the vectors
The cosine similarity formula only cares about the angle between the vectors, not their length. This means that vectors of different lengths can still have a cosine similarity of
1
if they're pointing in the same direction.
Digging into the cosine similarity formula
The mathematical formula itself (as cribbed from Wikipedia) looks like this:
where:
- A⋅B is the dot product of vectors A and B.
- ∣∣A∣∣ is the magnitude of vector A.
- ∣∣B∣∣ is the magnitude of vector B.
- θ is the angle between the two vectors.
That seems straightforward enough, but to translate it into a useable JavaScript function I'll need to answer a few followup questions:
What is the "dot product" of two vectors?
"Dot product" basically means "multiply and add". The dot product of two vectors is the sum of the products of their corresponding elements.
For example, the dot product of the vectors [1, 2, 3]
and [4, 5, 6]
is 1 × 4 + 2 × 5 + 3 × 6 = 4 + 10 + 18 = 32. Given vectors a
and b
, the dot product can be calculated in JS like this:
const dotProduct = a.reduce((acc, cur, i) => acc + cur * b[i], 0);
What is the "magnitude" of a vector?
In simple language, the "magnitude" of a vector is its length. In mathematical terms, it's the square root of the sum of the squares of its elements.
For example, the magnitude of the vector [1, 2, 3]
is √(12 + 22 + 32) = √(1 + 4 + 9) = √14. Given a vector a
, the magnitude can be calculated in JS like this:
const magnitude = Math.sqrt(a.reduce((acc, cur) => acc + cur ** 2, 0));
But aren't we ignoring the magnitude of the vectors?
Calculating the magnitudes of the vectors feels counterintuitive given we want to ignore it. But by including the product of the magnitudes the formula can normalize the dot product, ensuring the similarity measure is independent of the vectors' lengths.
The full Cosine Similarity function in JavaScript
Now that we've unpicked all the parts of the formula, we can put them together to create a JavaScript function that calculates the cosine similarity of any two vectors.
export const cosineSimilarity = (a, b) => {
const dotProduct = a.reduce((acc, cur, i) => acc + cur * b[i], 0);
const magnitudeA = Math.sqrt(a.reduce((acc, cur) => acc + cur ** 2, 0));
const magnitudeB = Math.sqrt(b.reduce((acc, cur) => acc + cur ** 2, 0));
const magnitudeProduct = magnitudeA * magnitudeB;
if (magnitudeProduct === 0) return 0; // Prevent division by zero
const similarity = dotProduct / magnitudeProduct;
return similarity;
};
Why doesn't this function use
Math.cos()
?Given how much we've been talking about "cosines", it's surprising that our
cosineSimilarity
function doesn't make use of JavaScript's built-inMath.cos()
function. This is because our function directly computes the cosine of the angle without needing to determine the angle itself. UsingMath.cos()
would require first calculating the angle using an inverse trigonometric function likeMath.acos()
, which is unnecessary and computationally more expensive.
So why is cosine similarity useful for comparing LLM embeddings?
There's a reason cosine similarity is the recommended way to compare LLM embeddings. The important part of an embedding is its direction, not its length. If two embeddings are pointing in the same direction, then according to the model they represent the same "meaning".
Because cosine similarity measures the similarity of two vectors based on their direction, ignoring their length, it's the perfect tool for comparing embeddings. It's also computationally cheap, which is a bonus. And, as we've seen, can be implemented in just a few lines of JavaScript.
The power of embeddings comes from their multidimensionality. The vectors are long, and the relationships between the numbers in the vectors are complex. But the principles of cosine similarity remain the same, no matter how many dimensions you're working in. It's a simple and effective way to compare vectors.
Are there alternatives to cosine similarity?
Cosine similarity is the most common way to compare vectors, but it's not the only way. I'll be exploring alternatives in next month's post, so pop your email in the box at the bottom of this page if you want to be notified when it's published.
Update: The follow-up post is now live, Alternatives to cosine similarity. Click through to learn about the mechanics of Euclidean, Manhattan, and Chebyshev distances, and how they compare to cosine similarity.
Related posts
If you enjoyed this article, RoboTom 2000™️ (an LLM-powered bot) thinks you might be interested in these related posts:
Mapping LLM embeddings in three dimensions
Visualising LLM embeddings in 3D space using SVG and principle component analysis.
Similarity score: 73% match . RoboTom says:
TomBot2000: automatically finding related posts using LLMs
How I used LLM embeddings to find related posts for my statically-generated blog and then used GPT4 to explain why they're similar.
Similarity score: 67% match . RoboTom says: