Alternatives to cosine similarity
Cosine similarity is the recommended way to compare vectors, but what other distance functions are there? And are any of them better?
Last month we looked at how cosine similarity works and how we can use it to calculate the "similarity" of two vectors. But why choose cosine similarity over any other distance function? Why not use Euclidean distance, or Manhattan, or Chebyshev?
In this article we'll dig in to some alternative methods for comparing vectors, and see how they compare to cosine similarity. Are any of them faster? Are any of them more accurate?
Overview
Comparing vectors is a powerful tool when working with LLM embeddings. It's often the technical underpinning of RAG pipelines (Retrieval Augmented Generation), for instance, where related content is "found" and injected into the context of a message passed to an LLM. Across the web, cosine similarity is the most frequently recommended way to compare vectors.
OpenAI themselves say as much in their documentation:
Which distance function should I use?
We recommend cosine similarity. The choice of distance function typically doesn’t matter much.
But there are plenty of other distance functions, so why is cosine similarity the most recommended? And are there any other functions that might be better suited to comparing LLM embeddings?
What are we comparing?
The end goal for all these similarity functions is to compare vectors. A vector is an array of numbers, and we refer to them as "vectors" because they can be conceptualized as points in space. A "two dimensional" vector (i.e. an array with two numbers) can be thought of as a point on a graph. A "three dimensional" vector (an array with three numbers) can be thought of as a point in 3D space. Their magnitude is the distance from the origin (0, 0
for 2D, or 0, 0, 0
for 3D) to the point.
In practice, I use similarity functions to compare LLM embeddings which are vectors with ~1,5k dimensions. For the purposes of this article, we'll stick to visualizing the different functions in two dimensions, but all the concepts covered are applicable to multidimensional vectors.
Note: in the context of LLM embeddings, a vector is a list of numbers that represents the "meaning" of a piece of text. The more "similar" the vectors, the more similar the meanings of the text. For more detail, see my post about mapping LLM embeddings in three dimensions.
Cosine similarity
In case you missed the first article, here's a quick recap of how cosine similarity works:
In cosine similarity, the final value is the cosine of the angle between two vectors. The similarity value is calculated by taking the dot product of the two vectors (A⋅B) and dividing it by the product of the magnitudes of the two vectors (∣∣A∣∣×∣∣B∣∣).
You can learn more about what "dot product" means (and how to calculate it) in the previous article.
Vectors created by OpenAI embeddings are normalized to a magnitude of 1. The full cosine similarity function does it's own normalization (by dividing by the magnitudes of the vectors), but the fact that the vectors are already normalized means we can simplify the function to just the dot product:
As the OpenAI documentation says:
OpenAI embeddings are normalized to length 1, which means that:
- Cosine similarity can be computed slightly faster using just a dot product
- Cosine similarity and Euclidean distance will result in the identical rankings
In JavaScript, calculating the dot product can be a simple one-liner (if you think reduce()
functions are simple):
const dotProduct = (a, b) => a.reduce((acc, cur, i) => acc + cur * b[i], 0);
Euclidean distance
By comparison to cosine similarity, Euclidean distance is a much simpler concept to visualize. It's a measure of the straight-line distance between two points in space. It's calculated by taking the square root of the sum of the squared differences between the two vectors.
const euclideanDistance = (a, b) => Math.sqrt(
a.reduce((acc, curr, i) => {
const diff = curr - b[i];
return acc + diff * diff;
}, 0)
);
Is Euclidean distance better than cosine similarity for use with LLM embeddings?
When the vectors are normalized to a magnitude of 1 (as OpenAI embeddings are), the results of cosine similarity and Euclidean distance will be equivalent. This means that the rankings of the vectors will be the same, even if the actual values are different.
There's not much to separate the functions when viewed from the perspective of time complexity (i.e. using "Big O" notation), as both functions have O(n) complexity. But the dot product uses simpler operations (one multiplication and one addition per iteration) than the Euclidean calculation (which performs subtraction, squaring, and addition for each iteration and includes an additional square root operation at the end).
For measuring the similarity of LLM embeddings, cosine similarity is more often chosen because dot-product-based cosine similarity is slightly faster to calculate.
Manhattan distance
Unlike Euclidean distance, which measures the straight line distance between two points, Manhattan distance measures the sum of the absolute differences between the two vectors. It gets its name from the fact that it's the distance a taxi would have to travel to get from one point to another in a city grid (where travel is limited to moving along the grid lines, rather than the as-the-crow-flies Euclidean distance).
const manhattanDistance = (a, b) => a.reduce((acc, curr, i) => acc + Math.abs(curr - b[i]), 0);
Is Manhattan distance better than cosine similarity for use with LLM embeddings?
The numerical results of Manhattan distance will be different to cosine similarity, but (as with Euclidean distance) there should not be much variation in the rankings when sorting vectors by similarity. As with the other two functions, the time complexity of Manhattan distance is also O(n), but the operations are simpler than Euclidean distance (just one subtraction and one addition per iteration) but a little more costly than the dot product calculation (with the "absolute" Math.abs()
calculation being the key differentiator).
When compared directly to Euclidean distance, Manhattan distance is often considered to be less affected by the "curse of dimensionality". This is a phenomenon where the distance between points in high-dimensional space becomes less meaningful as the number of dimensions increases. Manhattan distance is less affected by this because it doesn't square the differences between the points, which also makes it less sensitive to outliers (the squared differences in Euclidean distance can be heavily influenced by a single large difference).
Chebyshev distance
Chebyshev distance is the maximum of the absolute differences between the two vectors. While useful in certain scenarios, the fact that it discards the other differences between the vectors makes it less useful for comparing vectors in the context of LLM embeddings.
The Chebyshev distance is more useful for things like warehouse logistics, where it can measure the time an overhead crane takes to move an object (as the crane can move on the x and y axes at the same time but at the same speed along each axis).
const chebyshevDistance = (a, b) => 1 - Math.max(...a.map((curr, i) => Math.abs(curr - b[i])));
Other ways to measure similarity
There are a lot of distance functions out there. For this article, I've leaned into the "geometrical" view of vectors - i.e. treating them as points in multidimensional space. This is because LLM embeddings are built to represent the "meaning" of text, and the crucial part of that is the "directional" relationships between the vectors. Embeddings are all normalized and have no sense of "magnitude" (i.e. the length of the vector), so the "direction" is the only thing that matters.
Other geometric distance functions
- Minkowski distance and Canberra distance are functions that build on the foundation of the Euclidean and Manhattan functions. Minkowski is a generalization of both that can generate either of those distances or something in between, and Canberra is a weighted version of Manhattan distance.
- Mahalanobis distance is a different process entirely, which measures the distance between a single point and a distribution. This is useful for intra-vector analysis, but not helpful for comparing vectors to one another.
Correlation-based comparison functions
These measures focus on the relationship between variables rather than direct vector comparison. They can be useful for analyzing trends or patterns in data, but are less useful for comparing vectors.
- Pearson correlation coefficient measures the linear relationship between two variables.
- Spearman's rank correlation coefficient measures the strength and direction of the relationship between two variables.
- Kendall's tau is a measure of the ordinal association between two measured quantities.
Set-based comparison functions
These are more suitable for comparing sets or binary vectors, which may not be directly applicable to continuous-valued embeddings. Sets, unlike vectors, are unordered collections of unique elements.
- Jaccard similarity measures the similarity between two sets.
- Hamming distance measures the number of positions at which the corresponding elements are different.
Testing the functions
I put these functions to the test with some of my own data, and the results were generally in-line with what I expected.
I created an embedding vector for the titles of all my blog posts. I then created embeddings for each query in my search-test "ground truth dataset" and compared each query to every post title (98 queries each tested against 79 document titles, with each query and title represented by a 1,536-dimensional vector). I ran this test five times for each distance function, and averaged the results of each run.
The execution time for each comparison call (for example cosinesimilarity(vector1, vector2)
) was measured with the JS performance.measure()
method.
The "accuracy" was measured using a combination of F-score, Average Precision (AP), and Normalized Discounted Cumulative Gain (NDCG). You can read more about how these measures work in my post, "How do you test the quality of search results?".
Performance results
I was expecting a little variation in execution time for each comparison, but what I wasn't expecting was the bimodal nature of the results. For each function, there were two distinct groups of execution times. These peaks existed for all the function types at consistent relative spacings, which suggests that certain vector comparisons took longer than others no matter which distance function was being used.
The main takeaway, however, is as expected. Calculating the dot product was the fastest calculation, but not my a large margin. Euclidean distance was pretty close, followed by Manhattan distance and finally Chebyshev distance.
Accuracy results
Function | F-score | AP | NDCG | Combined |
---|---|---|---|---|
Cosine similarity | 0.2366 | 0.7602 | 0.8067 | 0.6012 |
Dot Product | 0.2366 | 0.7602 | 0.8067 | 0.6012 |
Euclidean distance | 0.2366 | 0.7602 | 0.8067 | 0.6012 |
Manhattan distance | 0.2364 | 0.7554 | 0.8041 | 0.5986 |
Chebyshev distance | 0.2076 | 0.6297 | 0.6908 | 0.5093 |
As predicted by the theory, the cosine similarity, dot product, and Euclidean distance functions all produced identical results. The Manhattan distance function was slightly less accurate, and the Chebyshev distance function was the least accurate. I was actually surprised to see Chebyshev score as highly as it did, and I suspect that as the size of the dataset increases the Chebyshev scores would drop off considerably.
Note: to ensure an apples-to-apples comparison, these tests measured how well each function ranked the entire list of document titles for each query - essentially just sorting the documents by similarity. This is why the F-scores in particular are quite low (by including all the documents in the ranking, the precision is diluted).
Which is the best choice for comparing LLM embeddings?
The crucial thing to keep in mind when choosing a distance or similarity function is this: what kind of difference do you care about? In the case of LLM embeddings, what we ultimately care about is "how similar are the meanings of these bits of text". For our embedding vectors, the "meaning" is represented by the direction of the vector, so the best comparison function would be one that focuses on the angle between the vectors and ignores other factors.
The "full" cosine similarity function is clearly the most accurate measure of this, given that it focusses exclusively on the directionality of the vectors it compares. This is why we see cosine similarity emerge as the most frequently recommended choice. And yet in practice Euclidean distance could potentially produce identical results with less complex (and therefor faster!) computation.
So is Euclidean distance what we should all be using? Again, we need to look at what we're measuring. For OpenAI embeddings, the vectors are all normalized. This means the similarity calculation can be simplified to just the dot product, and the dot product is marginally more efficient to compute than the full Euclidean distance.
TL;RD: For normalized vectors like those used by LLM embeddings, calculating the dot product is the optimal way to determine their similarity.
Related posts
If you enjoyed this article, RoboTom 2000™️ (an LLM-powered bot) thinks you might be interested in these related posts:
How does cosine similarity work?
When working with LLM embeddings, it is often important to be able to compare them. Cosine similarity is the recommended way to do this.
Similarity score: 82% match . RoboTom says:
Mapping LLM embeddings in three dimensions
Visualising LLM embeddings in 3D space using SVG and principle component analysis.
Similarity score: 66% match . RoboTom says: