iSearch, uSearch
Just returned from a month in Asia and Australia (inset pix of Shanghai nightlife). Fascinating centers of innovation from Tokyo to Singapore to Sydney. All have some interesting twists to next-gen Web applications and search. But that's for another time, another post. Today it's time we revisit the death of federated search (aka metasearch, single search, etc.) as we know it, and share a glimpse of what the future holds for finally solving the very elusive problem of getting at all of our information as easily as we should be able to. Ok, just a moment...ok, yep just checked, and it's still dead, dead as a doornail.
My friends and colleagues at Stanford University library and info-sciences department have been researching this problem head-on with over 700 databases of searchable research and academic content at their disposal. They are not alone. Countless universities, companies, and web services at large have found themselves at the end of the same dead end road.
Single search doesn't work, nor does traditional metasearch, or any other twists on federated search. Clustering metasearched results from multiple sources into artificial categories or groups only exacerbates the problem, and thoroughly confuses the end user. (sidebar: clustering has a very long way to go before it is anywhere near ready for prime time public consumption. Until then, it has no business being in search) These approaches have, in fact, proven pointless and only further delay any attempts to arrive at an acceptable user experience for effectively accessing a multitude of content sources simultaneously. I would go as far as saying that these so called solutions are robbing these important customers of their youth. Costing not just hundreds of thousands in license fees, but years of setbacks and distractions dealing with totally ineffective solutions. Everybody seems to have an angle that to me is nothing short of amusing. You'll notice through that link several spins on the same broken solution. I reviewed everything listed in those results. RIP.
There is a reason why basic search remains so widely popular, effective, and accepted by the vast majority of info seekers. Because it works. Because it is simple and intuitive. People get it. What people don't get are kludgey attempts to mash a bunch of square pegs into a round hole. If you look at the quality of search results from any of the tens or hundreds of enterprise search vendors, metasearch peddlers, and then say, Google, what you'll find might surprise you. Or maybe it won't. Yes, obviously Google.com works, and Google Search Appliance is no different. GSA stuck to its roots from Google.com for a reason: simple and intuitive user experience and high quality results-- from ONE source. Today GSA can crawl and index virtually any type of info object or database in existence. Why bother promoting new content in separate databases? This only adds to the problem. And with Google OneBox, we go even further, wiring competing content management systems to a better Google-controlled search experience.
So just what am I getting at? No, I'm not pimping Google's 'wares, but I am using them as one of only a few early examples of how to correctly begin to approach this problem. The answer is simple. One source. One index. One search interface. The fact that 700 databases sit in front of the info seeker is the real problem. There is no cohesive data model to support any meaningful metasearch whatsoever. "Normalizing" the boolean structure of the query language for each source's retrieval method was thought to 'standardized' the results that come back from all these random content sources. Not so. For it is not the query that matters, rather it is how the content is indexed. Just because the genre or subject nature of two content databases appears to be 'related' does not imply that the returned results will be the best combination of the two sources. Why? Because they have completely independent relational structures, metadata schemas, and ontologies.
Federated search, as we knew it before it died, did nothing more than mask this problem with a bland search interface wrapped around a broken and discontinuous distributed data model. Despite the cold reality, many of you still employ this type of solution at an increasingly expensive cost to your company and to your users' productivity.
But let's get back to the answer. Google introduced Universal Search, after quietly testing the concept under an alias website: searchmash.com. Yep, they really do. Universal Search is not there yet, but it is a move in the right direction. Yes, even Google faced a minor federation/metasearch problem as they continued to grow laterally into new content categories, e.g. News, Photos, Videos, Blogs, Products, Scholar, etc... As a result, it became increasingly unclear whether Google.com was the right place to start a search with so many alternate entry points that may be more appropriate for certain searches, e.g.: blogsearch.google.com, or news.google.com, and many more.
Universal Search is an early attempt to give the user a little taste of everything: pictures, videos, blogs, news, and web search results in one result page. Check out this basic example here for Steve Jobs. You get what I'm saying. Now, this doesn't exactly scale if you have 20, 30, or 700 types of content, or content sources to display on a page. They simply wouldn't fit. Additionally, Universal Search is more about displaying content of different types or formats versus merely different sources of content. For example, web pages, news articles, pictures, and videos are all very different types of content. I have designed two unique ways to address this problem, following some of the principles of Universal Search. Enter Integrated Search.
The integration of content sources is where we begin. The devil is most certainly in the details for this design and implementation, but here is the gist:
Recipe for Integrated Search
Ingredients
n parts of unique content sources
1 part really nice crawler/indexer (Nutch, GSA, or Lucene)
1 part high quality query interface with boolean translators, NLP, and auto completion and suggestion. (See CiteSeer or ACM for several)
Frappé all ingredients until smooth. Let stand and cool for 10 minutes.
Season to taste with one or both of the following:
1 search index inverter (yes, the secret sauce)
A dash of user intent interpolation at the point of query
This solves 3 problems at once. A single index, so that no sources need be considered at query time, ever. Smart pre-query processing to help guide the search query to match the users' intent. (We'll discuss intent-driven searches, or lack thereof, in an upcoming post.) And a powerful index/ranker to ensure that every content object in the index, from every original source is uniformly considered when ordering and displaying the results that best match the query.
This is NOT the case with traditional federators, which do nothing more than combine search results from hundreds of different indexing methodologies, with absolutely no way to 'honestly' or intelligently rank and order results that come from different indexers and ranking algos.
So even without revealing the secret sauce, you can see how this approach is fast, simple, and aligned with traditional search user experiences. The hard part? Crawling all the content sources means writing system adapters to content to the weirdest of old school flat file DB's, obscure object databases, and a whole lot worse. But if you pick a good crawler or general search product, much of that hacking has been done for you, as with Google's Search Appliance and their 220+ adapters that work pretty well out of the Box, pun intended.
So about that secret sauce? Well with a good inference about the user's intent we can bias the search results to better cater to the user's objective. And as for index inverting, its really about inverting the results that come from the index, for a given query. Ever curious what results actually appear at the end of a big web search with 5,400,000 results? How about dead middle of those 5.4 mil? Curious aren't we? Yes, it's all about discovery, and those deeper results can more useful that you might think.
As screen real estate continues to increase on the desktop/laptop, we'll no doubt continue to see search results get 'fatter' as in wider across the page. Yes, two and three column search results are on the way. And wait till you see where the ads turn up. For search its just the beginning. For federated search, well maybe we'll call it a new beginning. But for them, this means starting over. Completely.
So far I've yet to see any legitimate newcomers enter the arena to take up this challenge/opportunity head-on. In the meantime, partial solutions are manifesting within Web search while Google, Yahoo, and Ask continue to advance some good ideas in this arena. Yes, even Ask has been doing 'Unified' Search on their home page for a while now, and it's actually a reasonably clean UI...try out this query: iPhone be sure to stretch your browser as wide as it will go...not bad.
Integrated Search, iSearch. Coming to a theater near you? We'll soon find out...
Read the full story

