How does cosine similarity work?

When working with LLM embeddings, it is often important to be able to compare them. Cosine similarity is the recommended way to do this.

Look up "how to compare vectors" and cosine similarity will be the most common (if not the only) approach you will see. I've been working with vectors a lot lately in the context of LLM embeddings, and being able to measure how similar any two embeddings are has become an important part of my workflow. But how does the cosine similarity process actually work?

I've been relying on copy/pasting cosine similarity code without really understanding how it works. To give myself a deeper understanding, I want to answer the following questions:

  1. How does the cosine similarity formula work?
  2. What do the different parts of it mean?
  3. Why is this a useful method for comparing LLM embeddings?

What do we mean by "vectors"?

Before getting too deep, it's worth clarifying what we mean by "vectors". For my projects I'm using the terms "embedding" and "vector" interchangeably. To quote a previous post of mine:

The general concept of "embeddings" is an offshoot of the Large Language Model (LLM) technology that makes tools like ChatGPT work. The basic idea is that you can take a piece of text (a blog post, for example) and turn it into a vector (an array of numbers). This vector is called an "embedding" and it represents the "meaning" of the text.

In short, embeddings are vectors and vectors are lists of numbers. If, like me, you translate all code into JavaScript, we're talking about arrays. const vector = [0.1, 0.2, 0.3, 0.4, 0.5]; is a vector.

The vectors created for LLM embeddings are very long arrays of numbers. An embedding created by OpenAI's ada-002 model contains 1,536 numbers. Mathematically that can be described as a vector in 1,536-dimensional space. 2D vectors (that is to say, arrays with only 2 numbers) are much less useful in the context of embeddings, but the same principles apply, and we can use them to illustrate how cosine similarity works.

Why is cosine similarity useful when comparing vectors?

Being able to compare how similar two vectors are is a key part of working with embeddings. Cosine similarity is the recommended way to do this.

If we picture a super-simplified 2D space where all our vectors have only two values, the angle between two vectors is the angle between the two lines they represent. The lines are drawn from the origin (0, 0) to the end of the vector, treating the two vector numbers as x/y co-ordinates.

Note: if we had vectors with three values, we'd be working in 3D space. Four values, 4D space, and so on. The principles of cosine similarity remain the same no matter how many dimensions you're working in.

Simplified examples showing the angle θ between two vectors in 2D space.

The θ (theta) value is the angle between the two vectors. This is the angle required to rotate one vector to align with the other. The cosine of that angle (cos(θ)) gives us the cosine similarity: a number between -1 and 1.

If the directions of the vectors are identical, the cosine similarity is 1. If they're orthogonal (at right angles), the cosine similarity is 0. If they're opposite, the cosine similarity is -1.

Cosine similarity ignores the magnitude of the vectors

The cosine similarity formula only cares about the angle between the vectors, not their length. This means that vectors of different lengths can still have a cosine similarity of 1 if they're pointing in the same direction.

Digging into the cosine similarity formula

The mathematical formula itself (as cribbed from Wikipedia) looks like this:

Cosine Similarity=cos(θ)=ABA×B\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \times ||\mathbf{B}||}

where:

  • A⋅B is the dot product of vectors A and B.
  • ∣∣A∣∣ is the magnitude of vector A.
  • ∣∣B∣∣ is the magnitude of vector A.
  • θ is the angle between the two vectors.

That seems straightforward enough, but to translate it into a useable JavaScript function I'll need to answer a few followup questions:

What is the "dot product" of two vectors?

"Dot product" basically means "multiply and add". The dot product of two vectors is the sum of the products of their corresponding elements.

AB=a1b1+a2b2++anbn\mathbf{A} \cdot \mathbf{B} = a_1 b_1 + a_2 b_2 + \ldots + a_n b_n

For example, the dot product of the vectors [1, 2, 3] and [4, 5, 6] is 1 × 4 + 2 × 5 + 3 × 6 = 4 + 10 + 18 = 32. Given vectors a and b, the dot product can be calculated in JS like this:

const dotProduct = a.reduce((acc, cur, i) => acc + cur * b[i], 0);

What is the "magnitude" of a vector?

In simple language, the "magnitude" of a vector is its length. In mathematical terms, it's the square root of the sum of the squares of its elements.

v=v12+v22++vn2\|\mathbf{v}\| = \sqrt{v_1^2 + v_2^2 + \ldots + v_n^2}

For example, the magnitude of the vector [1, 2, 3] is √(12 + 22 + 32) = √(1 + 4 + 9) = √14. Given a vector a, the magnitude can be calculated in JS like this:

const magnitude = Math.sqrt(a.reduce((acc, cur) => acc + cur ** 2, 0));

But aren't we ignoring the magnitude of the vectors?

Calculating the magnitudes of the vectors feels counterintuitive given we want to ignore it. But by including the product of the magnitudes the formula can normalize the dot product, ensuring the similarity measure is independent of the vectors' lengths.

The full Cosine Similarity function in JavaScript

Now that we've unpicked all the parts of the formula, we can put them together to create a JavaScript function that calculates the cosine similarity of any two vectors.

cos(θ)=ABA×B\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \times ||\mathbf{B}||}
export const cosineSimilarity = (a, b) => {
    const dotProduct = a.reduce((acc, cur, i) => acc + cur * b[i], 0);
    
    const magnitudeA = Math.sqrt(a.reduce((acc, cur) => acc + cur ** 2, 0));
    const magnitudeB = Math.sqrt(b.reduce((acc, cur) => acc + cur ** 2, 0));
    
    const magnitudeProduct = magnitudeA * magnitudeB;
    if (magnitudeProduct === 0) return 0; // Prevent division by zero
    
    const similarity = dotProduct / magnitudeProduct;
    return similarity;
};

Why doesn't this function use Math.cos()?

Given how much we've been talking about "cosines", it's surprising that our cosineSimilarity function doesn't make use of JavaScript's built-in Math.cos() function. This is because our function directly computes the cosine of the angle without needing to determine the angle itself. Using Math.cos() would require first calculating the angle using an inverse trigonometric function like Math.acos(), which is unnecessary and computationally more expensive.


So why is cosine similarity useful for comparing LLM embeddings?

There's a reason cosine similarity is the recommended way to compare LLM embeddings. The important part of an embedding is its direction, not its length. If two embeddings are pointing in the same direction, then according to the model they represent the same "meaning".

Because cosine similarity measures the similarity of two vectors based on their direction, ignoring their length, it's the perfect tool for comparing embeddings. It's also computationally cheap, which is a bonus. And, as we've seen, can be implemented in just a few lines of JavaScript.

The power of embeddings comes from their multidimensionality. The vectors are long, and the relationships between the numbers in the vectors are complex. But the principles of cosine similarity remain the same, no matter how many dimensions you're working in. It's a simple and effective way to compare vectors.

Are there alternatives to cosine similarity?

Cosine similarity is the most common way to compare vectors, but it's not the only way. I'll be exploring alternatives in next month's post, so pop your email in the box below if you want to be notified when it's published.



Related posts

If you enjoyed this article, RoboTom 2000™️ (an LLM-powered bot) thinks you might be interested in these related posts:

Mapping LLM embeddings in three dimensions

Visualising LLM embeddings in 3D space using SVG and principle component analysis.

Similarity score: 73% match . RoboTom says:

TomBot2000: automatically finding related posts using LLMs

How I used LLM embeddings to find related posts for my statically-generated blog and then used GPT4 to explain why they're similar.

Similarity score: 67% match . RoboTom says:

Older post:

How do you test the quality of search results?

Published on


Signup to my newsletter

Join the dozens (dozens!) of people who get my writing delivered directly to their inbox. You'll also hear news about my miscellaneous other projects, some of which never get mentioned on this site.