TF-IDF — A simple explanation.
In the field of Information Retrieval, TF-IDF is a common statistic used to measure how important a word is to a document in a collection (corpus) of documents.
This article introduces the components of TF-IDF and explains how it can be used for search query ranking. The emphasis is on simple explanations and easy to follow examples.
Let’s start with the setup that we’re going to use through the rest of this article. Let’s say we have a corpus of randomly selected 10,000 Wikipedia articles related to nature. A small amount of articles talk about foxes and a very small amount of them talk about red foxes.
Some Commonly Used Terminology
Corpus — a collection of documents (think all Wikipedia articles).
Term — essentially a word (like “fox”).
TF — Term Frequency
Measures how common a term is in a particular document.
Formula:
TF(term, document) = times term appears in document / number of terms in document.
Example:
Document: The brown fox is common in the woods of the Pacific Northwest.
TF(the, document) = 3 / 11 = 0.27
TF(fox, document) = 1 / 11 = 0.09
Of course Wikipedia documents would be longer, but this is an example.
Meaning:
High TF — common in document.
Low TF — not common in document.
IDF — Inverse Document Frequency
Measures how common a word is in the corpus.
Formula:
IDF(term, corpus) = log(number of documents / number of documents containing the word)
Example:
In our corpus of 10,000 wikipedia articles, we can have:
IDF(“the”, corpus) = log(10,000 / 9,850) = 0.01
IDF(“fox”, corpus) = log(10,000 / 3) = 8.11
Meaning:
High IDF — rare word.
Low IDF — common word.
Brining it together — TF-IDF
TF-IDF is measured by multiplying TF(term, document) by IDF(term, corpus).
Example:
Term: fox
TF — higher in documents mentioning foxes often.
IDF — high, it’s a rare word.
Term: red
TF — higher in documents mentioning “red” often.
IDF — lower, it’s a more common word.
Meaning:
TF-IDF is high for words that are commonly found in a document, but rare in the corpus. That makes sense. If the word is “fox” and we have one article that talks about foxes a lot in a corpus where we don’t have many fox references, that article is obviously very relevant for the word fox.
Using TF-IDF in search ranking
Let’s explain how we can use TF-IDF for search ranking.
One simple way of doing it is this: if we have a query of K terms, we can go over our N documents and compute the TF-IDF for each of the K terms in relation to every document. For every document we can then sum the computed TF-IDFs and consider that as the score of document for that query.
Example 1: single word query.
If the query is “fox”, the IDF is going to be the same for every document. The TF is going to be higher for the documents which contain the word fox the most. So we’ll end up ranking these the highest. That makes sense.
Example 2: two words query.
Let’s say we have a search query “red fox”. Let’s work through a few documents to get an idea of how are their scores going to look like.
Document d1 — talks about foxes, but not about red foxes.
TF-IDF(red, d1)
= TF(red, d1) * IDF(red, corpus)
= (2/250) * log(10000/60) = 0.008 * 5.11 = 0.04
TF-IDF(fox, d1)
= TF(fox, d1) * IDF(fox, corpus)
= (30/250) * log(10000/3) = 0.12 * 8.11 = 0.97
Score: 0.04 + 0.97 = 1.01
Document d2 — talks about red foxes.
TF-IDF(red, d2)
= TF(red, d2) * IDF(red, corpus)
= (16/250) * log(10000/60) = 0.064 * 5.11 = 0.32
TF-IDF(fox, d2)
= TF(fox, d2) * IDF(fox, corpus)
= (35/250) * log(10000/3) = 0.14 * 8.11 = 1.13
Score: 0.32 + 1.13 = 1.45
Document d3 — talks about something else.
TF-IDF(red, d3)
= TF(red, d3) * IDF(red, corpus)
= (1/250) * log(10000/60) = 0.004 * 5.11 = 0.02
TF-IDF(fox, d3)
= TF(fox, d3) * IDF(fox, corpus)
= (0/250) * log(10000/3) = 0 * 8.11 = 0
Score: 0.02 + 0 = 0.02
Let’s summarize. Note how fox, which is a more rare word than red, contributes more to the score for the query “red fox”. This is a very useful property of using TF-IDF for ranking. For the query “red fox”, we’d rather get articles about foxes than articles about red things.
In the field of information retrieval, there are various TF-IDF inspired ranking functions, like Okapi BM25 and others.