How do you test the quality of search results?

Improving or comparing search algorithms requires being able to objectively test their performance. How on earth do you actually do that?

Published on Jul 2024Modified on Jul 2024 – Read time: roughly eleven minutes.Written by Tom Hazledine

The site-search for this blog needs improving, but I have no idea how to measure how "good" or "bad" it is. If I can't measure it, then how can I tell if any future changes improve it? With this in mind I've finally done some research and put together a handy toolset for assessing the quality of my site's search functionality. Strap in, this is a deep dive!

Part 1: learning the basic concepts

It turns out that "Information Retrieval" is rich field of academic study, and comes with a wealth of approaches for evaluating results. I'm not inventing anything new here, just documenting my own mini voyage of discovery.

Precision, Recall, and the F-score

When assessing the quality of search results, the two key factors to consider are precision and recall.

Precision is the proportion of retrieved documents that are relevant.
Recall is the proportion of relevant documents that are retrieved.

Precision and recall are both equally important, so it's useful to combine these metrics into a single factor: the F-score. This is a score between 0 and 1, with 1 meaning "the results are perfect" and 0 meaning "the results are terrible".

\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

The F-score (or, more accurately, the F₁-score¹) uses the harmonic mean rather than a simple mathematical average. This gives a balanced measure that penalises extreme values, so both precision and recall need to be high to achieve a high F₁-score.

Average Precision

An F-score shows the trade-off between precision and recall, but gives no consideration to the ranking of the results. And in the real world the position of a relevant document in the results list is kind of important (just ask anyone who's website appears on the second page of Google).

An assessment measure that does account for ranking position is Average Precision (AP). Similar to the F-score, AP measures the quality of search results on a 0–1 scale. Unlike the F-score, however, AP considers the order in which relevant documents appear, giving more weight to relevant documents that appear earlier.

Average Precision is calculated by averaging the precision at each position where a relevant document is retrieved.

\text{AP} = \frac{1}{R} \sum_{k=1}^{n} P(k) \cdot \text{rel}(k)

where:
𝑅 is the total number of relevant documents for the query.
𝑛 is the total number of retrieved documents.
𝑃(𝑘) is the precision at position 𝑘.
rel(𝑘) is a binary indicator function that is 1 if the document at position 𝑘 is relevant, and 0 otherwise.

Normalized Discounted Cumulative Gain

To go even deeper, we can also use a concept called Normalized Discounted Cumulative Gain (NDCG). Where AP decreases the weighting of results linearly as they appear further down the list, Discounted Cumulative Gain (DCG) does this logarithmically. This means that DCG gives exponentially higher weights to top-ranked documents.²

DCG is computed by summing the relevance-scores of documents in the order they are retrieved, discounted logarithmically by their position in the ranking (ensuring higher-ranked relevant documents contribute more to the score).

\text{DCG}_p = \sum_{i=1}^{p} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}

where:
𝑝 is the rank position up to which the documents are considered.
rel_𝑖 is the relevance score of the document at position 𝑖.
log₂(𝑖+1) is the logarithm base 2 of (𝑖+1).

As its name suggests, NDCG normalizes the DCG score. It does this against an ideal DCG score (IDCG), allowing for better comparison of different queries. IDCG is found with the same formula, but using the ideal order of documents instead of the retrieved documents. The two values, DCG and IDCG, are then divided to give us NDCG.

\text{NDCG} = \frac{\text{DCG}}{\text{IDCG}}

“Ground truth” and subjectivity

F₁, AP, and NDCG are all very official sounding science-y terms, but one aspect I've avoided addressing so far is this: an assessment of search performance will always be subjective.

All of those “objective” metrics rely on the same thing: a preset definition of what an “ideal” search looks like. Put plainly, to measure how many relevant results have been returned for a query I need to know which pages would be relevant in the first place.

Common practice is to define a “ground truth dataset” that contains example queries and their ideal results. But I have to make this judgment myself. It's a mini halting problem: to programatically judge whether a search is perfect, I'd have to build a perfect search algorithm.

The content I'm searching is my blog, so each "relevant document" is a page on my site, represented in the dataset by it's URL.³

{
    "data visualisation": [
        "/known-pleasures-svg-line-art/",
        "/mapping-llm-embeddings-in-3d/",
        "/improving-svg-chart-interactivity-with-voronoi-diagrams/"
        // etc...
    ],
    "static site generator": [
        "/eleventy-static-site-generator/",
        "/llm-related-posts/",
        "/client-side-search-static-site/",
        // etc...
    ],
    // etc...
};

Summarising search-assessment techniques

For my n00b perspective, the important takeaway is that I now have a set of three values to compare, each of which is conveniently expressed as a number between zero and one.

The F₁-score tells me how well the algorithm balances precision and recall.
The Average Precision tells me how well the algorithm retrieves relevant documents.
The Normalized Discounted Cumulative Gain tells me how well the algorithm ranks relevant documents by their importance.

Part 2: implementing the assessment in JavaScript

To put all these ideas into practice I'll need to turn them into JavaScript functions that I can use.

The test framework

My long-term plan is to test multiple different search algorithms, so it makes sense to build a framework that can be reused with different implementations.

I'll need to ensure each algorithm I test returns results in the same format, which in my case will be an array of document URLs. For now, I'll just use a single example algorithm: myAwesomeSearch().

This code assumes that I've imported my JSON index as index and my example search algorithm as myAwesomeSearch.⁴

// Example search algorithm
const exampleSearch = {
    title: "Algo #1",
    fn: myAwesomeSearch(index),
    parser: result => result.item.url
};

const searches = [
    exampleSearch,
    // Add more search algorithms here...
];

F₁, AP, and NDCG all need to be run on a per-query basis, and then averaged across all queries to give a final score.

For each search algorithm this is a three-stage process where we 1. get the search results for each ground-truth query, 2. run the per-query assessments for each set of results, and 3. get the average value (a.k.a. the mathematical mean) for each metric.

// Get results for each search algorithm
const finalResults = searches.map(search => {

    // Run search for each query in ground-truth
    const searchResults = Object.keys(groundTruth).map(query => {
        const results = search.fn.search(query);
        const parsedResults = results.map(result => search.parser(result));
        return {
            query,
            documents: groundTruth[query],
            results: parsedResults
        };
    });

    // Get per-query metrics
    const metrics = searchResults.map(item => {
        const F1 = getF1(item.results, item.documents);
        const AP = getAP(item.results, item.documents);
        const NDCG = getNDCG(item.results, item.documents);
        return { ...item, F1, AP, NDCG };
    });

    // Get average results
    const meanF1 = metrics.reduce((acc, item) => acc + item["F1"], 0) / metrics.length;
    const meanAP = metrics.reduce((acc, item) => acc + item["AP"], 0) / metrics.length;
    const meanNDCG = metrics.reduce((acc, item) => acc + item["NDCG"], 0) / metrics.length;

    return {
        title: search.title,
        metrics,
        meanF1
        meanAP,
        meanNDCG,
    };
});

// Output results
console.log(finalResults);

The only parts missing now are the actual assessment functions: getF1(), getAP(), and getNDCG().

Calculating an F-score in JavaScript

The getF1 function takes two arguments:

The set of results retrieved by the search algorithm for a query.
The ideal list of relevant results for that query from the ground truth dataset.

const F1 = getF1(retrievedDocs, relevantDocs);

The function itself works by comparing the two sets of results to generate precision and recall values which are then fed into the F₁ formula:

\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

const f1Score = 2 * (precision * recall) / (precision + recall);

The full getF1 function looks like this:

const getF1 = (retrievedDocs, relevantDocs) => {
    const relevantSet = new Set(relevantDocs);

    const relevantRetrieved = retrievedDocs.filter(doc =>
        relevantSet.has(doc)
    ).length;

    // Bail early to avoid division by zero
    if (relevantRetrieved === 0) return 0;

    const precision = relevantRetrieved / retrievedDocs.length;
    const recall = relevantRetrieved / relevantDocs.length;

    const f1Score = (2 * (precision * recall)) / (precision + recall);

    return f1Score;
};

Calculating Average Precision in JavaScript

getAP takes the same two arguments as getF1: the actual results and the ideal results.

const AP = getAP(retrievedDocs, relevantDocs);

The calculation itself is more complicated.

\text{AP} = \frac{1}{R} \sum_{k=1}^{n} P(k) \cdot \text{rel}(k)

const getAP = (retrievedDocs, relevantDocs) => {
     const isRelevant = doc => relevantDocs.includes(doc);

    const precisionAtK = (retrievedDocs, k) => {
        const relevantRetrieved = retrievedDocs
            .slice(0, k)
            .filter(isRelevant).length;
        return relevantRetrieved / k;
    };

    const relevantPrecisions = retrievedDocs
        .map((doc, k) =>
            isRelevant(doc) ? precisionAtK(retrievedDocs, k + 1) : 0
        )
        .filter((_, k) => isRelevant(retrievedDocs[k]));

    const sumPrecision = relevantPrecisions.reduce(
        (sum, precision) => sum + precision,
        0
    );
    const R = relevantDocs.length;

    return R > 0 ? sumPrecision / R : 0;
};

Calculating Normalized Discounted Cumulative Gain in JavaScript

Calculating NDCG requires two functions: getDCG and getNDCG.

getDCG compares the results of a search to the “relevant documents” ground truth:

\text{DCG}_p = \sum_{i=1}^{p} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}

const getDCG = (retrievedDocs, relevantDocs) => {
    return results.reduce((acc, result, i) => {
        const relevance = relevantDocs.includes(result) ? 1 : 0;
        return acc + relevance / Math.log2(i + 2);
    }, 0);
};

The getNDCG function uses getDCG to get the DCG values from both the retrieved documents and the ideal results, and then divides them to get the normalized value.

\text{NDCG} = \frac{\text{DCG}}{\text{IDCG}}

export const getNDCG = (retrievedDocs, relevantDocs) => {
    const dcgValue = getDCG(retrievedDocs, relevantDocs);
    const idcgValue = getDCG(relevantDocs, relevantDocs);
    return dcgValue / idcgValue;
};

Part 3: running the tests

To sum up so far, I've worked out some metrics to test for and built the JS framework to allow me to test any search algorithm I want. The next step is to run the tests and see how my current search functionality performs.

I've constructed my ground truth dataset by looking at the most common search terms that lead folks to my site, plus sprinkling in a few "obvious" terms that I know should return a single result. (You can view the full JSON file in this Gist). I've kept my list of test queries quite short, but for future tests will aim to be more comprehensive.

The results

mean F₁	mean AP	mean NDCG
0.4284	0.5439	0.5597

I also logged each query's individual F₁, AP, and NDCG values:

Per-query results (rounded to 4 decimal places to keep the table legible)

Query	F₁	AP	NDCG
"data visualisation"	0.7143	0.8083	0.9270
"static site generator"	0.6667	1.0000	1.0000
"web component"	0.8000	0.8061	0.9088
"wordle"	1.0000	1.0000	1.0000
"11ty vs hugo"	0.0000	0.0000	0.0000
"eleventy vs jekyll"	0.0000	0.0000	0.0000
"llm embedding"	1.0000	1.0000	1.0000
"llm embedding visualization"	0.0000	0.0000	0.0000
"bullet journal"	0.8571	1.0000	1.0000
"react d3 line chart"	0.0000	0.0000	0.0000
"picobel"	0.8000	1.0000	1.0000
"add delay to audio online"	0.0000	0.0000	0.0000
"how to calculate reading speed"	0.0000	0.0000	0.0000
"decibel"	0.1600	1.0000	1.0000

What does this tell me?

I've actually found these results surprising! Due to the small scale of the content being searched over there are quite a few queries that return 1s across the board. This part is to be expected, as any specific query with only one or two matches will either be found or not found. More concerning and interesting are the queries that don't get a full match.

Partial matches. There are a few queries that show a decimal value denoting a partial-match with the ground-truth, and that seems to correlate with queries that have a lot of matched documents. The inference here is that either the order of the results is less than ideal, or that the search algorithm is returning too many results.
Total misses. More concerning are the total misses; for such a small number of test queries there are a lot of them that return no results at all. This is a clear sign that the search algorithm is not working as intended. I'd definitely expect "llm embedding visualization" to return a result, but the "-ize" vs "-ise" spelling difference is enough to throw it off. I thought my matching algorithm was smart enough to account for this, but clearly I was wrong.

Next steps

From the point of view of "actionable insights", this is great. I have very specific things to address, and a set way to test if my improvements have improved things. I'll also work on expanding the ground-truth dataset to include more queries, and to make sure that the queries are as varied as possible.

As for the overall mean results (the mean F₁, the MAP, and the mean NDCG), there's not much I can glean from them at this stage other than "I need to make the numbers go up". They'll be more useful when it comes to tracking changes to the algorithm, or comparing different algorithms entirely.

Thankfully there was some value to conducting this deep-dive, and I'm looking forward to seeing how the search functionality (and my understanding of the search functionality) improves over time.

In an F₁-score the "1" denotes that precision and recall are equally weighted. The more generic F_β-score allows for variable weighting of precision and recall.↩
NDCG can also be used in scenarios with graded relevance (i.e. documents A and B are both relevant, but B is more relevant).↩
Thankfully my site is relatively small and I've got a pretty good mental map of all the articles. I imagine that building the "ground truth" for a bigger, more complex set of documents would be a tedious and slow process.↩
The parser function will be different for each search algorithm. It extracts the URL from the search result object so all searches present their results in the same format. For this example we're assuming that myAwesomeSearch returns results that each have an item with a url.↩

If you enjoyed this article, RoboTom 2000™️ (an LLM-powered bot) thinks you might be interested in these related posts:

Newer post:

How does cosine similarity work?

Published on Aug 2024

Older post:

What even is a week? (dates are hard)

Published on Jun 2024