Searching: Vector Space methods
Following on from my previous post describing the basic theory of text processing for a search engine, I want to go on to describe some of the theory behind the evaluation of search results. This post is only going to cover vector space methods.
At the end of the previous post we had generated a list of tokens and a list of their occurrences in each document.
In the simplest case, I could search for a single word that only occurs in a single document in which case a simple inverted index mapping tokens to documents will suffice . Even in the case where we search for a single word that occurs in multiple documents, we can straightforwardly choose to rank the documents in order of the number of occurrences of the search word, those with more occurrences being more highly ranked. The real trouble starts with multi-word queries, how do we rate the relative importance of each search word and how do we combine this information to produce a single ranked list?
In a vector-based method we only consider the number of occurrences of each token in each document - for this reason phrase searching is difficult with this approach, requiring either additional indexes or post-match filtering. The vector space we consider is a high dimensional one, having one dimension for each token that occurs in our document collection. Each document is then represented by a vector in this space that has a length along each dimension equal to the number of times that token appears in the document. For example consider the following vectors showing the number of occurrences of the four words "following", "previous", "lot" and "spent" in two documents:
FOLLO | PREVI | LOT | SPENT | |
Document 1 | 2 | 0 | 1 | 1 |
Document 2 | 1 | 1 | 0 | 0 |
The intuitive picture behind this model is then that similar documents will produce term vectors that point in a similar direction. Equally, we can turn our query into a term vector and then the document term vectors closest to it will be the best matches. There are several different measures we could use to calculate this notion of closeness of term vectors, one common choice is calculate the (cosine of the) angle between the two vectors: v1.v2 /||v1|| ||v2||, where v1.v2 is a dot product and || v|| is the norm or length of the vector.
For our simple example, with the query "following" (which has the term vector [1,0,0,0]) we get:
Document 1: [2,0,1,1] . [1,0,0,0] / || [2,0,1,1] || || [1,0,0,0 || = 2 / sqrt(6) * 1 = ~ 0.81 Document 2: [1,1,0,0] . [1,0,0,0] / || [1,1,0,0] || || [1,0,0,0] || = 1 /vsqrt(2) * 1 = ~ 0.71
Showing that document 1 is considered a better hit for this particular query.
In practice, we can perform this query very efficiently by storing all the (normalised) document term vectors in a matrix and multiplying this matrix by the (normalised) query vector. Such a matrix will generally be sparse (most elements will be zero) and there is much literature and a number of libraries dedicated to efficient storage of, and calculation with, such matrices.
We can not only apply this method to search queries, but also to entire documents. Given our intuitive picture of the vector space having similar documents pointing in similar directions, we can use this to provide a 'find similar documents' service by using the vector of the entire document as a search vector.
A popular example of a vector-based textual search engine is lucene - an excellent apache project that provides a search engine toolkit from which search applications can be created.