Federated Search is dead, dead, dead...
I've been asked to write about this for some time, and that time has finally come. O'Reilly touched on the subject recently discussing Google's plans in this arena. But white papers on the subject are not required to explain why traditional approaches to this dilemma are toast. In this post I'll explain why federation is broken and how corporations, universities, and start-ups continue to throw $ at the wrong end of the problem...click below to dive in!
Federation defined
Federated search is the art of attempting to execute a single keyword search across n number of databases, content sources, indexes, news feeds, etc. This is also known as metasearch, deep-web search, and content aggregation. No central federated index is maintained and no crawling or spidering required. The idea, in theory anyway, is certainly a convenient one. At Groxis, 90% of our customers were most interested in federating their enterprise content sources. In large companies and universities alike the sea of available content silos for any given organization is vast. It is not uncommon to find hundreds, even thousands of content sources used across a single organization.
Federation is all about passing the user's search query separately to each of those search engines, and collecting x number of search results from each source, and then figuring out how to display them to the user in some meaningful, useful, actionable way. This display challenge occurs because typically many of the content sources are not crawlable or spiderable due to licensing issues, ownership of the content, or because the content resides in a data store that is not crawable. Examples: SQL databases, proprietary content management systems, and commercial content such Lexis, Factiva, Reuters, etc. Federated search is a quick and dirty way to scan across a vast array of content sources.
Federated challenges
However, there is a fundamental usability and search logic problem with today's generic search federation. Let's presume you are federating just 20 content sources into a single search query interface. Using a traditional search results display format, 10 results per page, we arrive at problem #1: results ordering and display.
Arbitrary solutions (aka common hacks)
A. Should we apply a weighting alogithm to each of the sources to favor more 'important' sources. Sure we could. But this arbitrary not contextual, and thus totally inefficient.
B. Should we apply speed? First results to come back get displayed first? Hardly contextual, hardly consistent nor sufficient. A poor man's federation to be sure. More on performance issues later.
C. How about ordering all the results into topic clusters? Sounds great, this allows us to organize all the results from all of our 20 sources into a cluster map, organized by topics, not content sources. On the surface this could indeed address some of federations shortcomings. However the problem is that topic clustering technology is woefully inadequate for serious research or just serious federation. I've reviewed, licensed, and tested every serious clustering engine in development, and even hacked together my own clustering algorithms over the past several years. They all have a common problem: They require optimization and customization to each and every content source, and never work consistently enough to overcome mass user adoption. They require unique stop word lists, phrase delineations, dictionaries, cluster tuning, label tuning, and a host of other tweaks. I could go deep here, but let's not get off topic. In fact, wait for my next posting that illustrates why document clustering is also dead, dead, dead.
I mentioned speed above. The other big usability problem is the speed at which each source returns results. Often times we cannot produce a combined results set because the federation engine is waiting on sources to return with their results. Some sources can be woefully slow, causing totaly response times to take up to 3-5 minutes! Yes, I've seen this in production at large enterprise sites. This is how to cream mass user adoption in about exactly... 3-5 minutes.
D. Another common 'solution' is to let the user pre-select the content sources from which to federate the keyword search. Sounds reasonable on the surface, until you have 20 or 700 data sources to choose from. Even grouping them together leaves too much to the imagination from a usability standpoint. User's aren't trained to 'think' about these intricacies, they just search and go. Advanced Search panes are rarely utilized correctly, if at all. Further, most users will know much less about each content source than the federation platform does. As such, having source selection choices is a massive burden on the user if there are more than 7-12 sources to choose from. In the end, this does not solve the problem, in most cases it adds to it.
The real solution - does one really exist?
Wouldn't it be nice if there were a simple, elegant solution to this most vexing problem? Librarians, universities, researchers, and knowledge enterprises would rejoice with a resounding thunder! And the company with the solution would similarly rejoice in the prying open of even the tightest purse strings of customers vying to get their hands on the proven solution once and for all.
Well, there is good news and bad news. The good news, there is an obvious solution. The bad news is, that is really, really, really hard to do. The solution: index everything. (Note: this is not the same as metasearch, which only aggregates results from separate search engines, as metasearch has no indexing capability...for now ;) One index one result set for all content online. Yes, I said it. If literally every content source were opened up to be crawled and indexed without prejudice, a single, uniform index could go to work providing users the most useful results from a single search. Sort of like removing DRM from digital music, in a way, I suppose. Let the content be free! The difference being, premium content publishers would not have to open up the body of the content to the end user. Just look at how Yahoo and others handle searching premium content. You can access the metadata (title, author, abstract, summary, etc.) and then you pay to gain access to the full text. [ Paying for content is yet another topic all together. Yes, it too is dead, dead, dead...] Yahoo's subscription content federation is an example of the "index everything" solution on a much smaller scale. Though this implementation is only partially effective here and for only small groups of content sources.
In theory a web index such as Google.com, is a federated index of sorts, culling together millions of small and large 'content sources' known as websites into a centralized search index. Fundamentally no different from metasearching, but architecturally and contextually vastly different user experiences and effectiveness.
The bad news is also obvious. It is seemingly impossible to get all content sources opened up, and indexed anytime soon. Not to mention the privacy, copyright, formatting, and global policy issues that surround the notion. Just look at all the flak Google gets for scanning books in a library. Given this, might there be another way? Another approach that achieves maximum usability, and extracts maximum value from any cross section of content sources for the user, the researcher, the knowledge worker? I believe there is such a solution in development today. For hints as to the direction of such an approach, let me point you to a few successful 'mini-federators' in the web2.0 world that are really effective.
The current design of the 'dashboard' as we know it does not scale to support n number of content sources, and certainly not 700 or 1000, but CSE and Pipes are a very different story. Essentially do-it-yourself federated indexes as I described earlier. Very high potential. Particularly as screen real-estate runs at and all time premium today. As such, if we are to arrive at true front-end solution versus a back-end (index everything) solution, it has to scale. It must also remain simple, efficient, and require an almost zero learning curve. New visual metaphors have only exacerbated the adoption and usability problems that plague most federation solutions in the market today.
If search federation sounds like a rather elusive problem to solve, I can promise you, elusive is an understatement. The answer lies in how we interact with, process, and digest information instinctively, not 'intuitively' as most info designers would have you believe. Intuition-driven approaches only lead to new products and solutions that chase their own tail, never really solving the problem at hand. We have seen, and will see yet more companies come and go with their valiant attempts to crack the code for federated search. But until the real problems with federation are truly understood, be prepared for more tail wagging. Ironically enough however, it appears to me a solution might soon be launched...right under our noses...hmmm. As always, stay tuned!
2 comments:
Lifehacker has a post about the new Alpha search from Yahoo. It's their take on bringing together search results from various sources.
From J.W. Lehman,
founder of Verity
Real Source Content / Result Federation is alive and well
“Old Federated searchers never die, they just become…..”
anon
1. The poster hasn’t a clue about the purpose of federated search in information retrieval / research. Should “federated search” take the blame for slow/poor collection access? Of course not. Federated search is NOT, as the poster claims, an “interactive” single collection search mechanism, ala google, verity or any like…it’s a “watcher-monitor” of what is going on in the info-world in specific subject areas. If the poster told his enterprise customers they were getting google-for-the-deep-web, the poster just didn’t understand their requirements….typical for IR technology vendors, and VCs. Who cares if the answer takes 5 minutes or 5 hours? The purpose of federated search is sending-alerting new relevant material as it’s generated. Federated search is a very powerful, and quick, research assistant WHEN IT IS APPLIED PROPERLY.
Federated search supports COMMUNITIES OF INTEREST by replacing the incredibly complex need to individually access and merge content from all appropriate sources in the search for answers (regardless of their “fun-ness” to access), with a process that does it on command.
If the user can’t wait 5 minutes or 5 whatevers for results that he/she couldn’t obtain in 5 weeks-months of manual effort, then the sources themselves must be unnecessary. The poster, and most of the rest of us, have fallen under the google-spell that time to first result and time-to-answer are the same. Not! How long does it take to find the fact/assumption/relationship in google/convera/verity/zylab/inxight result # 870? We’ll Never Find It, because we gave up after result 25.
2. “keyword” search? What century is the poster from? If you can’t explore content via explicit taxonomies with the searchrules to back them up, of course you’re going to get poor, mixed up results. [and not only is clustering is dead, dead, dead…, it was never alive!]
Index everything!!!!!!!!! Why bother? Keyword search will give you the same mess on an indexed collection…actually worse, because it’s only the rare and to-date, unpopular engine that recognized the presence of evidence at the meaningful text unit (i.e. paragraph) level….so instead of federated search telling you your “KEY-WORD” is actually in the title/snippet/abstract, you now get to discover the 1000x list of content where it’s anywhere in the full-text. What an advancement!
3. Result Federation…..The ability to de-dupe, de-mystify and normalize results from multiple relevancy determination techniques has been available for years…where have you been? All that’s necessary is to make a practical relevance determination of each result based upon the search request; and order it.
4. In any subject, google-yahoo-ms-altavista-etc, lets you find out what everyone
else already knows…..the ability to find out what nobody else knows/surmises is
virtually denied. That is what federated search is for … multi-disciplined
communities of interest seeking answers to advance knowledge, as opposed to
wikipedias-google results.
Post a Comment