Being Verity in a Google World
"Real Source Content / Result Federation is alive and well"
- J.W. Lehman, founder of Verity
I would be remiss in not taking the opportunity to respond to an interesting comment posted to my blog back in June on Federated Search. The response came from a founder of Verity, a leading enterprise search vendor acquired by Autonomy. The comments revitalize the debate surrounding the evolution of information retrieval and the evolution of information storage. To best clarify my position, and provide a rebuttal to the points raised by J.W., I'll provide my comments in-line and in bold following the poster's comments. The debate is on, right after the jump!
From J.W. Lehman,
founder of Verity
Real Source Content / Result Federation is alive and well
“Old Federated searchers never die, they just become…..”
anon
1. The poster hasn’t a clue about the purpose of federated search in information retrieval / research. Should “federated search” take the blame for slow/poor collection access? Of course not. Federated search is NOT, as the poster claims, an “interactive” single collection search mechanism, ala google, verity or any like…it’s a “watcher-monitor” of what is going on in the info-world in specific subject areas. If the poster told his enterprise customers they were getting google-for-the-deep-web, the poster just didn’t understand their requirements….typical for IR technology vendors, and VCs. Who cares if the answer takes 5 minutes or 5 hours? The purpose of federated search is sending-alerting new relevant material as it’s generated. Federated search is a very powerful, and quick, research assistant WHEN IT IS APPLIED PROPERLY.
Well, I think the opening paragraph pretty much says it all: "Who cares if the answer takes 5 minutes, or 5 hours?" Who indeed...hmmm. How about everyone. Everyone who's grown up with today's superior web search engines at their disposal. I would love to take a poll to see how many people would be willing to wait 5 minutes much less 5 hours for ANY form of search. But let us carry on.
We must first correct the poster's bold premise because federated search clearly does not belong in the 'watcher/monitor' category. Watcher/monitors have a very distinct membership that is quite different from federated search. RSS/Atom feed readers, dashboards, and RSS aggregators such as iGoogle, NetVibes, NewsGator, Bloglines, and OriginalSignal, are watcher/monitors. They are not federated search players whatsoever. Search is pull not push. It is active not passive. Feed aggregation is passive, and it pushes. Apples and oranges here.
Federated search supports COMMUNITIES OF INTEREST by replacing the incredibly complex need to individually access and merge content from all appropriate sources in the search for answers (regardless of their “fun-ness” to access), with a process that does it on command.
J.W. obviously hasn't read my other posting on Yahoo Pipes, and Google CSE. I offer up this as prerequisite reading material before claiming that big search engines can't address 'communities of interest' in a far easier and more powerful way.
If the user can’t wait 5 minutes or 5 whatevers for results that he/she couldn’t obtain in 5 weeks-months of manual effort, then the sources themselves must be unnecessary.
This remark invokes the proverbial 'wake up and smell the coffee' response. Every search engine in existence has invested millions in R & D and usability studies to unanimously confirm and conclude that speed matters. And it doesn't just matter, it is vital to achieving wide spread adoption and utility-- it is vital to survival. It is often difference between #1 in the industry and #100.
See this link for empirical evidence to this point. You'll find that user adoption, and user satisfaction is of paramount importance to the search experience. A 500ms drop in response time results in millions of abandoned searches and unsatisfied users. Can you imagine what would happen if these users had to wait 100 times as long? That would be 50 seconds. How about, as J.W. suggests, 1000 times as long? I think we can all predict the outcome.
The poster, and most of the rest of us, have fallen under the google-spell that time to first result and time-to-answer are the same. Not! How long does it take to find the fact/assumption/relationship in google/convera/verity/zylab/inxight result # 870? We’ll Never Find It, because we gave up after result 25.
This is no spell. This is reality. The world has evolved people. The majority of web surfers are a few of us Gen-x'ers, Gen-Y, Z, Millennials. Most were born into this world with a cell phone in hand, and broadband, and Wi-Fi everywhere. The expectation of always on, instant gratification, and real-time computing convenience is not a nice to have in today's world, it is now merely an assumed, necessary requirement. And, they are the best and brightest generations of our time.
2. “keyword” search? What century is the poster from? If you can’t explore content via explicit taxonomies with the searchrules to back them up, of course you’re going to get poor, mixed up results. [and not only is clustering is dead, dead, dead…, it was never alive!]
We do agree on one point above-- clustering is not ready for prime time. Beyond that, perhaps our differences are simply generational. I am part of the Internet generation, and not a day earlier. Let's be real folks, keyword search works, it works really, really well. It is undisputedly the fastest, most popular, and most effective universal mechanism for finding information today.
Today's keyword search engines are anything but just keywords today. But my discussion is not (and has not ever been) about keyword searching. It is about federated search, and its shortcomings, and why we must everything. But for the sake of discussion here's my quick take on the state of keyword search technology: Today's 'keyword search interpretation' technologies are more intelligent, proactive, interpretative, interpolative, and extrapolative than ever before. They are capable of much more than meets the eye. But that is the point, to keep it simple to the user, to appear as if the system is 'idiot proof' and the all it takes are a few simple keywords and magic happens. This is increasingly becoming the case today. More to do, this is certain. However, keyword search is still by far the most effective input mechanism to for matching information with your intent, even if you aren't fully aware of your intent nor fully knowledgeable on the subject you pursue. See an upcoming post titled: "Browsing the Web for Knowledge Using Keyword Search."
The industry deadpool is full of vendors that once hocked taxonomies, directories, and other structured content browsers. Taxonomies are great for very specialized collections of content, but they totally implode when mashed together by a federated search engine and 10 other content sources with totally different ontologies, categories, and metadata. It just doesn't work when blended together from completely different sources.
Index everything!!!!!!!!! Why bother? Keyword search will give you the same mess on an indexed collection…actually worse, because it’s only the rare and to-date, unpopular engine that recognized the presence of evidence at the meaningful text unit (i.e. paragraph) level….so instead of federated search telling you your “KEY-WORD” is actually in the title/snippet/abstract, you now get to discover the 1000x list of content where it’s anywhere in the full-text. What an advancement!
Why bother, hmmm...why indeed... Well, let's see... the last time someone got the idea to do this the right way, out popped a couple of life changing web companies with worldwide adoption and sustained valuations in the tens and hundreds of billions of dollars.
But here's a better reason: It just plain works.
Every federated search engine, including Verity, when plugged into multiple sources for keyword searching does at least this much: pass the keyword queries to each content source wired to the federated search, and get results back from each, the keyword way. We know there are many other ways to retrieve content from a source, but this topic is and has always been about federated searching, not federated browsing, nor conceptual matching. All of which can still be done better with a single index of content anyway.
3. Result Federation…..The ability to de-dupe, de-mystify and normalize results from multiple relevancy determination techniques has been available for years…where have you been? All that’s necessary is to make a practical relevance determination of each result based upon the search request; and order it.
Regarding the existence of de-duping, etc. I distinctly don't recall saying anything to the contrary. I merely support the fact that all implementations to date do not work very well. Not one federated search engine can possibly make a reliable relevance determination based on the search query for one simple reason: it is not up to the federated engine to decide! The results that come in from each disparate content source are determined by the ranking and relevancy engine of each source's proprietary algorithm. Thus, even if the federated engine could magically infer the inter-source ranking with some degree of usefulness (though doubtful), the net results would only be as good as the worst ranking algo from the worst content source. Let's look at a simple illustration to clarify, shall we:
Step 1: Example query: nanotechnology fabrication
Step 2: Sources 1-5 are selected to 'federate' - assume sources 3-5 have terrible ranking engines
Step 3: The above keywords (yes keywords J.W.) are passed to each sources' query engine
Step 4: The "top ten" results are returned from each source's relevancy engine
Step 5: the 50 results are some how re-ranked based on the nature of the query? I'd like to see that. Especially since the results returns are merely title, snippet, URL, and NOT full-text. As is the case with every standard enterprise and web search engine index.
Step 6: Regardless, sources 3-5 poorly ranked documents make it impossible to unify the ranking in anything but a largely arbitrary way, and giving arbitrary credibility of the results list.
Step 7: Because the federation technology has no way to evaluate how well a given source is ranking its own documents, it is impossible to establish a consistently high quality set of ordered results, using this antiquated yet widely suggested way of federating.
4. In any subject, google-yahoo-ms-altavista-etc, lets you find out what everyone
else already knows…..the ability to find out what nobody else knows/surmises is
virtually denied.
This belief makes one heck of a gross assumptions as to the way in which any of the aforementioned engines employ page ranking. Discovery is purely a function of the nature of the access methods to the information source, all other things being equal. With a single index of content I can create discover, knowledge, connectedness, and relatedness of concepts, sentences, subjects, and more without the need for federating a single thing. It was called Grokker 2.3 Desktop for Google, back in 2004. Today its called Google CSE for a single source, and for multiple sources its called Yahoo Pipes.
That is what federated search is for … multi-disciplined
communities of interest seeking answers to advance knowledge, as opposed to
wikipedias-google results.
June 11, 2007 3:35 PM
Federated search as it exists today is not a social medium, and it was never intended to be. Collaborative filtering, collective intelligence on the other hand, is the future today. Has someone slept through the web2.0 phenom? digg, delicious, feedburner, flickr, Wize, Yelp, Google Reader, iGoogle. Web 2.0 companies have already categorically taken this aging notion of 'communities of interest' via metasearch tools and turned it upside down-- and actually made it work for the first time. And while all of these new web services aggregate content from a huge multiple of sources, they are not federated search engines in any sense of the word, as I have described in all of my postings.
What's more, equating or limiting the definition of federated search to apply only to research/enterprise content versus searching public WWW content, is a significant misnomer.
For if the best of today's web search engines were to index ALL of the available high quality, structured enterprise/research content behind the firewall (which now a few of them are doing, btw), I could then profess the end of old-school federated search, that has plagued enterprises, universities, and the world at large for over a decade now. Giving way to entirely new ways of federating, classifying, categorizing content-- but from a universal index of content with standardized metadata and shared ranking algorithms.
So my position remains unchanged, if not reinforced. The doctor has checked the patient for a pulse, and she's still dead as a doornail. Good night and good bye my dear federator...
Read the full story