WEBINAR: Tactics for Extending Unified Search to the Four Corners of Your Global Enterprise
Have you ever wondered how Google returns a search result in less than a second? It’s all made possible through the magic of indexing. On periodic basis, Google will go out across the Internet and crawl all of the content that it can access. During this process, the crawler pulls each and every document, webpage, or whatever, back to the indexer, were the document is broken down into the list of words it contains. Google creates a database which in the world of search is often called the index. When a user executes a query, the index is what is queried for relevant data resulting is sub-second response time. Think of an index as a data warehouse for unstructured information.
Now all of this sounds great, but there are times, when indexing information is not appropriate or even possible.
One of the issues with indexing is that the index is only as up to date as your most recent crawl. If you have data that changes rapidly keeping the index up to date can be a challenge. Booking a reservation on an airline is a good example. When a customer books a flight querying an index doesn’t make sense. The only way to ensure you’re not giving away someone’s seat is to query the reservation system directly. (Then again, they’ll overbook the flight anyway!)
Beyond the “freshness” of the data, there are other reasons why indexing may be impractical. One show stopper is bandwidth. Indexing eats up bandwidth. Every document or piece of data is pulled from the source location back to the Indexer. If you are indexing a terabyte of data, you are moving a terabyte across the network.
Some organizations in countries in the E.U. and Canada, won’t permit parent companies in the U.S. to index their content. Their concern is privacy. The Patriot Act gives the U.S. government the right to seize any data from any organization if there is a perceived security concern. An index essentially copies anything that it crawls; a prime target for a discovery action.
One large organization that BA Insight is currently working with has subsidiaries in various countries across the globe. The subsidiaries have signed agreements with customers that prohibit the data from being transported across country boundaries.
One final example that I can think of is a situation where a subsidiary has a highly specialized search deployment and simply don’t want corporate to centralize search for fear of losing their customizations.
I’m sure there are other scenarios were indexing isn’t practical that I haven’t covered here. So all of this being said, is there a solution to all of this?
Enter Federated Search
Federated Search technology has been around for about a decade now. When a user executes a query, it is intercepted by the Federator. The Federator then passes the query to a number of different search engines, which execute the query. The search result from each search engine is passed back to the Federator, which merges each result into single unified result list.
The value of all of this that all indexing is done locally, by each respective search engine. This solves all of the issues mentioned above including bandwidth and privacy concerns. Since all indexing is done locally, bandwidth is a non-issue. Privacy is not an issue either, because a user can only query for data that he or she has access to in the first place. It can also solve the challenge of “data freshness”. If the relevant data is stored in a transactional database, you are getting data directly from the source system. You’ll always get the most up to date information. Problem solved. Another argument for Federation is it enables access to vast amounts of data that one would not wish to index for sheer lack of resources. Why index the internet, when you can just Federate Google or Bing?
Now, Federation has its own set of challenges which include:
- Performance can be an issue. As an example, as the query is passed to multiple search engines if one is offline the user will have to wait until the request times out.
- Security can be a challenge, if the search engines utilize a different security model, single sign-on must be implemented.
- Data duplication can often clutter the merged search result
- Merging the search result is a challenge in and of itself in the sense that the Federator has to merge results from different engines that probably calculated relevance differently. How does one normalize that?
I am going to be conducting a webinar on the 23rd where I’ll cover Federation in general and present BA Insight’s approach to Federation. Hope to see you there!
Trackback from your site.