Alternatives to cosine similarity

Cosine similarity is the recommended way to compare vectors, but what other distance functions are there? And are any of them better?

Published on Sep 2024Modified on Sep 2024 – Read time: roughly ten minutes.Written by Tom Hazledine

Last month we looked at how cosine similarity works and how we can use it to calculate the "similarity" of two vectors. But why choose cosine similarity over any other distance function? Why not use Euclidean distance, or Manhattan, or Chebyshev?

In this article we'll dig in to some alternative methods for comparing vectors, and see how they compare to cosine similarity. Are any of them faster? Are any of them more accurate?

Overview

Comparing vectors is a powerful tool when working with LLM embeddings. It's often the technical underpinning of RAG pipelines (Retrieval Augmented Generation), for instance, where related content is "found" and injected into the context of a message passed to an LLM. Across the web, cosine similarity is the most frequently recommended way to compare vectors.

OpenAI themselves say as much in their documentation:

Which distance function should I use?
We recommend cosine similarity. The choice of distance function typically doesn’t matter much.

But there are plenty of other distance functions, so why is cosine similarity the most recommended? And are there any other functions that might be better suited to comparing LLM embeddings?

What are we comparing?

Three pairs of simple vectors in 2D space. We'll use these to illustrate the different distance functions later in this article.

The end goal for all these similarity functions is to compare vectors. A vector is an array of numbers, and we refer to them as "vectors" because they can be conceptualized as points in space. A "two dimensional" vector (i.e. an array with two numbers) can be thought of as a point on a graph. A "three dimensional" vector (an array with three numbers) can be thought of as a point in 3D space. Their magnitude is the distance from the origin (0, 0 for 2D, or 0, 0, 0 for 3D) to the point.

In practice, I use similarity functions to compare LLM embeddings which are vectors with ~1,5k dimensions. For the purposes of this article, we'll stick to visualizing the different functions in two dimensions, but all the concepts covered are applicable to multidimensional vectors.

Note: in the context of LLM embeddings, a vector is a list of numbers that represents the "meaning" of a piece of text. The more "similar" the vectors, the more similar the meanings of the text. For more detail, see my post about mapping LLM embeddings in three dimensions.

Cosine similarity

In case you missed the first article, here's a quick recap of how cosine similarity works:

\mathit{d}(A,B) = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \times ||\mathbf{B}||}

In this and following formulae, we'll use d(A,B) to represent the distance d between vectors A and B.

In cosine similarity, the final value is the cosine of the angle between two vectors. The similarity value is calculated by taking the dot product of the two vectors (A⋅B) and dividing it by the product of the magnitudes of the two vectors (∣∣A∣∣×∣∣B∣∣).

You can learn more about what "dot product" means (and how to calculate it) in the previous article.

Cosine similarity uses the angle θ between two vectors in 2D space.

Vectors created by OpenAI embeddings are normalized to a magnitude of 1. The full cosine similarity function does it's own normalization (by dividing by the magnitudes of the vectors), but the fact that the vectors are already normalized means we can simplify the function to just the dot product:

\text{Normalized} \cos(\theta) = A \cdot B

As the OpenAI documentation says:

OpenAI embeddings are normalized to length 1, which means that:
Cosine similarity can be computed slightly faster using just a dot product
Cosine similarity and Euclidean distance will result in the identical rankings

In JavaScript, calculating the dot product can be a simple one-liner (if you think reduce() functions are simple):

const dotProduct = (a, b) => a.reduce((acc, cur, i) => acc + cur * b[i], 0);

Euclidean distance

By comparison to cosine similarity, Euclidean distance is a much simpler concept to visualize. It's a measure of the straight-line distance between two points in space. It's calculated by taking the square root of the sum of the squared differences between the two vectors.

\mathit{d}(A,B) = \sqrt{\sum_{i=1}^n (A_i - B_i)^2}

Finding Euclidean distance d for vectors A and B of length n.

Euclidean distance is the straight-line distance between two points.

const euclideanDistance = (a, b) => Math.sqrt(
    a.reduce((acc, curr, i) => {
        const diff = curr - b[i];
        return acc + diff * diff;
    }, 0)
);

Is Euclidean distance better than cosine similarity for use with LLM embeddings?

When the vectors are normalized to a magnitude of 1 (as OpenAI embeddings are), the results of cosine similarity and Euclidean distance will be equivalent. This means that the rankings of the vectors will be the same, even if the actual values are different.

There's not much to separate the functions when viewed from the perspective of time complexity (i.e. using "Big O" notation), as both functions have O(n) complexity. But the dot product uses simpler operations (one multiplication and one addition per iteration) than the Euclidean calculation (which performs subtraction, squaring, and addition for each iteration and includes an additional square root operation at the end).

For measuring the similarity of LLM embeddings, cosine similarity is more often chosen because dot-product-based cosine similarity is slightly faster to calculate.

Manhattan distance

Unlike Euclidean distance, which measures the straight line distance between two points, Manhattan distance measures the sum of the absolute differences between the two vectors. It gets its name from the fact that it's the distance a taxi would have to travel to get from one point to another in a city grid (where travel is limited to moving along the grid lines, rather than the as-the-crow-flies Euclidean distance).

\mathit{d}(A,B) = \sum_{i=1}^n |A_1 - B_1|

Finding Manhattan distance d for vectors A and B of length n.

Manhattan distance is the sum of the absolute differences between two points.

const manhattanDistance = (a, b) => a.reduce((acc, curr, i) => acc + Math.abs(curr - b[i]), 0);

Is Manhattan distance better than cosine similarity for use with LLM embeddings?

The numerical results of Manhattan distance will be different to cosine similarity, but (as with Euclidean distance) there should not be much variation in the rankings when sorting vectors by similarity. As with the other two functions, the time complexity of Manhattan distance is also O(n), but the operations are simpler than Euclidean distance (just one subtraction and one addition per iteration) but a little more costly than the dot product calculation (with the "absolute" Math.abs() calculation being the key differentiator).

When compared directly to Euclidean distance, Manhattan distance is often considered to be less affected by the "curse of dimensionality". This is a phenomenon where the distance between points in high-dimensional space becomes less meaningful as the number of dimensions increases. Manhattan distance is less affected by this because it doesn't square the differences between the points, which also makes it less sensitive to outliers (the squared differences in Euclidean distance can be heavily influenced by a single large difference).

Chebyshev distance

Chebyshev distance is the maximum of the absolute differences between the two vectors. While useful in certain scenarios, the fact that it discards the other differences between the vectors makes it less useful for comparing vectors in the context of LLM embeddings.

The Chebyshev distance is more useful for things like warehouse logistics, where it can measure the time an overhead crane takes to move an object (as the crane can move on the x and y axes at the same time but at the same speed along each axis).

\mathit{d}(A,B) = \max(|A_1 - B_1|, |A_2 - B_2|, \ldots, |A_n - B_n|)

Chebyshev distance

Chebyshev distance is the longest of the absolute differences between two points.

const chebyshevDistance = (a, b) => 1 - Math.max(...a.map((curr, i) => Math.abs(curr - b[i])));

Other ways to measure similarity

There are a lot of distance functions out there. For this article, I've leaned into the "geometrical" view of vectors - i.e. treating them as points in multidimensional space. This is because LLM embeddings are built to represent the "meaning" of text, and the crucial part of that is the "directional" relationships between the vectors. Embeddings are all normalized and have no sense of "magnitude" (i.e. the length of the vector), so the "direction" is the only thing that matters.

Other geometric distance functions

Minkowski distance and Canberra distance are functions that build on the foundation of the Euclidean and Manhattan functions. Minkowski is a generalization of both that can generate either of those distances or something in between, and Canberra is a weighted version of Manhattan distance.
Mahalanobis distance is a different process entirely, which measures the distance between a single point and a distribution. This is useful for intra-vector analysis, but not helpful for comparing vectors to one another.

Correlation-based comparison functions

These measures focus on the relationship between variables rather than direct vector comparison. They can be useful for analyzing trends or patterns in data, but are less useful for comparing vectors.

Pearson correlation coefficient measures the linear relationship between two variables.
Spearman's rank correlation coefficient measures the strength and direction of the relationship between two variables.
Kendall's tau is a measure of the ordinal association between two measured quantities.

Set-based comparison functions

These are more suitable for comparing sets or binary vectors, which may not be directly applicable to continuous-valued embeddings. Sets, unlike vectors, are unordered collections of unique elements.

Jaccard similarity measures the similarity between two sets.
Hamming distance measures the number of positions at which the corresponding elements are different.

Testing the functions

I put these functions to the test with some of my own data, and the results were generally in-line with what I expected.

I created an embedding vector for the titles of all my blog posts. I then created embeddings for each query in my search-test "ground truth dataset" and compared each query to every post title (98 queries each tested against 79 document titles, with each query and title represented by a 1,536-dimensional vector). I ran this test five times for each distance function, and averaged the results of each run.

The execution time for each comparison call (for example cosinesimilarity(vector1, vector2)) was measured with the JS performance.measure() method.

The "accuracy" was measured using a combination of F-score, Average Precision (AP), and Normalized Discounted Cumulative Gain (NDCG). You can read more about how these measures work in my post, "How do you test the quality of search results?".

Performance results

Execution-time distribution for different distance functions (using JavaScript).

I was expecting a little variation in execution time for each comparison, but what I wasn't expecting was the bimodal nature of the results. For each function, there were two distinct groups of execution times. These peaks existed for all the function types at consistent relative spacings, which suggests that certain vector comparisons took longer than others no matter which distance function was being used.

The main takeaway, however, is as expected. Calculating the dot product was the fastest calculation, but not my a large margin. Euclidean distance was pretty close, followed by Manhattan distance and finally Chebyshev distance.

Accuracy results

Function	F-score	AP	NDCG	Combined
Cosine similarity	0.2366	0.7602	0.8067	0.6012
Dot Product	0.2366	0.7602	0.8067	0.6012
Euclidean distance	0.2366	0.7602	0.8067	0.6012
Manhattan distance	0.2364	0.7554	0.8041	0.5986
Chebyshev distance	0.2076	0.6297	0.6908	0.5093

The "combined" score is a simple average (mean) of the F-score, AP, and NDCG scores.

As predicted by the theory, the cosine similarity, dot product, and Euclidean distance functions all produced identical results. The Manhattan distance function was slightly less accurate, and the Chebyshev distance function was the least accurate. I was actually surprised to see Chebyshev score as highly as it did, and I suspect that as the size of the dataset increases the Chebyshev scores would drop off considerably.

Note: to ensure an apples-to-apples comparison, these tests measured how well each function ranked the entire list of document titles for each query - essentially just sorting the documents by similarity. This is why the F-scores in particular are quite low (by including all the documents in the ranking, the precision is diluted).

Which is the best choice for comparing LLM embeddings?

The crucial thing to keep in mind when choosing a distance or similarity function is this: what kind of difference do you care about? In the case of LLM embeddings, what we ultimately care about is "how similar are the meanings of these bits of text". For our embedding vectors, the "meaning" is represented by the direction of the vector, so the best comparison function would be one that focuses on the angle between the vectors and ignores other factors.

The "full" cosine similarity function is clearly the most accurate measure of this, given that it focusses exclusively on the directionality of the vectors it compares. This is why we see cosine similarity emerge as the most frequently recommended choice. And yet in practice Euclidean distance could potentially produce identical results with less complex (and therefor faster!) computation.

So is Euclidean distance what we should all be using? Again, we need to look at what we're measuring. For OpenAI embeddings, the vectors are all normalized. This means the similarity calculation can be simplified to just the dot product, and the dot product is marginally more efficient to compute than the full Euclidean distance.

TL;RD: For normalized vectors like those used by LLM embeddings, calculating the dot product is the optimal way to determine their similarity.

If you enjoyed this article, RoboTom 2000™️ (an LLM-powered bot) thinks you might be interested in these related posts:

Older post:

How does cosine similarity work?

Published on Aug 2024

Overview

What are we comparing?

Cosine similarity

Euclidean distance

Is Euclidean distance better than cosine similarity for use with LLM embeddings?

Manhattan distance

Is Manhattan distance better than cosine similarity for use with LLM embeddings?

Chebyshev distance

Other ways to measure similarity

Other geometric distance functions

Correlation-based comparison functions

Set-based comparison functions

Testing the functions

Performance results

Accuracy results

Which is the best choice for comparing LLM embeddings?

Related posts

How does cosine similarity work?

Mapping LLM embeddings in three dimensions

Signup to my newsletter