Searching: Text Preprocessing

I've spent a lot of time working on various searching and information retrieval systems and so thought I'd start off my new blog with a short series of posts on the basics of modern search systems.

This first post covers the initial processing of a textual document, taking a raw block of text and converting it into a sequence of tokens that will be used as the basis of the search engine's index. We will assume that extraneous markup has been removed and we are simply dealing with a block of English text.There are two pressures driving the tokenisation of any text, one is the need to minimise the index size and the second is to try and enhance the quality of matches found.

One way to minimize the index size is to employ a simple technique such as N-gram. In this method, we create tokens from the text by taking each subsequence of N letters, usually N is around 2 - 4.For example, the start of this sentence would thus tokenise to the following 3-grams:

'For', 'or e', 'r e', ' ex', 'exa', 'xam', amp', 'mpl', 'ple' .......

This approach has some advantages, mainly that the number of possible tokens is strictly limited by the number of possible 3-grams, e.g. if we ignored the case and any punctuation expect for spaces, we would have a maximum token set of size 27*27*27 = 19683, which is substantially less than the number of words in a dictionary, The main benefit and disadvantage of this approach is that the tokenisation makes no use of the actual properties of the language. For languages such as Japanese where more complex morphological techniques are difficult to apply this is a benefit, often in English language searching it can be a disadvantage.

Probably the simplest example of processing text based on the underlying language is the notion of stop words. The vast majority of common English words rarely appear usefully in the kind of keywordese people have learnt to use in search engines. So we can make a strategic decision to throw away the most common 'low meaning' words, such as 'a', 'an', 'the' etc. This has the benefit of vastly reducing the size of our index at the expense of loosing some precision in the matches, e.g. we may no longer be able to distinguish phrases such as "Man in the Moon" from "Man on the Moon" - or worse lose all chance of finding "The Who".

The final technique I'll mention here, stemming, is worthy of a post or two in its own right, so I'll be necessarily brief. Stemming is the process of reducing words to their root form, by removing or undoing the effects of verb conjugation or pluralisation. This is perhaps best shown by example, 'fishing', 'fishes', 'fishery' could all be viewed as essentially variants of the root word, 'fish' and so we can choose to ignore these variations and simply index the stem, 'fish' when we see any of these words. In English stemming is usually accomplished by removing the ends of words until a certain point is reached - other languages, including Cornish, make the job more complex by altering the first characters of some words as well.

The most common algorithm used for English is Porter Stemming or modifications thereof. This operates essentially by repeatedly removing common english suffixes until suitably few characters remain as the token to be indexed. Martin Porter maintains a collection of stemming algorithms and implementations called the snowball, which includes versions of his Porter stemmer and is used, for example, by the open source search engine Lucene.

Stemming is notable for not only reducing the size of the index, by reducing the the total number of different tokens that need to be stored but also for potentially improving the quality of results by returning documents containing conjugated versions of the search term.
It is important to note that for consistency it is essential that the search query be stemmed in the same way as the original text was processed.

Lets wrap this all up with an example:

It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife.

may well end up, following removal of stop words and after passing through a suitable stemmer with the following tokens to index

TRUTH, UNIVERS, ACKNOWLEDG, SINGL, MAN, POSSESS, GOOD, FORTUN, MUST, WANT, WIFE

With this index we should be able to retrieve this document in response to queries such as

universal truths
universal acknowledgement
"single man"
"good fortunes"

and is to the topic of index building and query retrieval that I will turn next.