A short while ago I was asked to compare Microsoft and Autonomy’s search offering. This was a challenging one. First of all, both companies sell search platforms so they can be extended (with customization) to do anything. Therefore, I couldn’t rely on a simple feature-by-feature comparison matrix.
A second challenge to the comparison was that Microsoft and Autonomy attempt to solve search differently. Microsoft uses Linguistic Analysis. When FAST or SharePoint index documents, the documents are represented as words in the index which users search against. Autonomy uses Statistical Analysis. In this approach documents are treated as mathematical representations. Statistical Analysis compares documents after they are indexed and attempts to find matching patterns in them. Documents that contain the same significant terms are probably describing similar concepts. The idea is that similar documents should appear in a search result even in situations when documents don’t necessarily contain all of the users keywords.
Autonomy markets this technology as “Meaning Based Computing”. In the search business it’s often called Concept Search. So I’m often asked “how does it work”? In this post I’ll try to explain. My background is in Engineering, so I’m familiar with the math, but I’ll do my best to keep your eyes from glazing over.
So let’s start with our documents. Run a search on Amazon.com using the keyword “Investing” and a list of titles come back. The titles of these books will represent the text of our documents.
1. The Neatest Little Guide to Stock Market Investing
2. Investing For Dummies, 4th Edition
3. The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns
4. The Little Book of Value Investing
5. Value Investing: From Graham to Buffett and Beyond
6. Rich Dad’s Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!
7. Investing in Real Estate, 5th Edition
8. Stock Investing For Dummies
9. Rich Dad’s Advisors: The ABC’s of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss
The first step in concepts search is to remove all words that convey little to no meaning:
1. Make a complete list of all the words that appear anywhere in the collection
2. Discard articles, prepositions, and conjunctions
3. Discard common verbs (know, see, do, be)
4. Discard pronouns
5. Discard common adjectives (big, late, high)
6. Discard frilly words (therefore, thus, however, albeit, etc.)
7. Etc, etc.
Here is what we are left with:
1. Neatest Little Guide Stock Market Investing
2. Investing Dummies, 4th Edition
3. Little Book Common Sense Investing: Only Way Guarantee Fair Share Stock Market Returns
4. Little Book of Value Investing
5. Value Investing: Graham Buffett Beyond
6. Rich Dad’s Guide Investing: Rich Invest, Poor Middle Class!
7. Investing Real Estate, 5th Edition
8. Stock Investing Dummies
9. Rich Dad’s Advisors: ABC’s Real Estate Investing: Secrets Finding Hidden Profits Most Investors Miss
Notice that words that appear at least twice in different documents are underlined (potentially related concepts!!)
The next step is to create a term-document matrix.
In this matrix, each indexed word is a row and each title is a column. Each cell contains the number of times that word occurs in that title. For example, the word “book” appears one time in title T3 and one time in title T4, whereas “investing” appears one time in every title. In general, the matrices built tend to be very large, but also very sparse (most cells are blank). That is because each title or document usually contains only a small number of all the possible words.
To reiterate…In the real world, this matrix would be massive. 10 million documents would create a matrix with 10 Million columns!
In the following matrix, I’ve left out the 0′s to reduce clutter.
The next step is the process is change the values (the 1′s) based on certain rules
The first transformation is to apply term “weighting”. This term is a formalization of two well-known insights in information retrieval:
1. Words that appear several times in a document are probably more meaningful than content words that appear just once.
2. Infrequently used words are likely to be more interesting than common words across documents.
The first of these insights applies to individual documents, and this is referred to as local weighting. Words that appear multiple times in a document are given a greater local weight than words that appear once. A formula called logarithmic local weighting is used to calculate a value.
The second insight applies to the set of all documents in our collection, and is called global term weighting. There are many global weighting schemes; all of them reflect the fact that words that appear in a small handful of documents are likely to be more significant than words that are distributed widely across our document collection. Rare words are weighted more heavily than common words. For example, a word that occurs in only 5% of the documents should probably be weighted more heavily than a word that occurs in 90% of the documents
Since the number of words in our example is small, I’ll skip this step.
There is a third step in weighting called normalization. This is a scaling step designed to keep large documents with many keywords from overwhelming smaller documents in our result set. It is similar to handicapping in golf – smaller documents are given more importance, and larger documents are penalized, so that every document has equal significance.
Finally, and by far the most complicated step, a powerful algorithm called Singular Value Decomposition (SVD) is used on the matrix. SVD finds a reduced representation of the matrix that emphasizes the strongest relationships and throws away the noise. In other words, it makes the best possible reconstruction of the matrix with the least possible information. This is really important as in the real world these matrices become gigantic and require tremendous processing to extract anything meaningful. SVD is really what makes Concept Search work or fail.
SVD throws out noise, which does not help, and emphasizes strong patterns and trends, which do help. The trick in using SVD is in figuring out how many dimensions or “concepts” to use when approximating the matrix. Too few dimensions and important concepts are left out, too many and noise caused by false concepts will creep back in.
The output of the SVD algorithm (in our example) is a two dimensional Singular Decomposition of our original matrix which shows which words in my titles are similar in nature (concepts) and also which titles (documents) they describe.
With this information a search engine can spot relationships between our terms (concepts) and the title (documents) that they describe. This plot should help visualize that:
We see that Title 7 and 9 can be described by the concept real estate, whereas titles T1 and T3 are described by the concept stock and market.
To conclude on this:
Concept Search has great promise and I’m sure there are many organizations that have deployed it and are very happy. But….
What’s clear is that Concept Search is complicated. Singular Value Decomposition is complex and someone would have to have taken higher math just to understand it.
Users of concept based search have often been heard complaining that they are getting unexplainable search results. This can be due to a number of factors. First of all in order for SVD to spot patterns, you need a huge number of documents (don’t be misled by my simple example). Additionally, the documents must be homogeneous in nature, meaning they must be about similar topics (that’s an obvious one, I know).
Another huge problem is that Concept Search doesn’t address Polysemy, which is a real problem for users…One word can describe multiple concepts. If I’m searching for “Apple”, is it the fruit or the company?
Finally IMHO, SVD is a black box. If you are getting results and users are complaining, what do you do? Microsoft Linguistic Analysis provides Admins full control of the search platform, from Indexing, to Relevance Ranking.