Sunday, November 11, 2007

And...We're Back!

This is what happens when you take your eye off the Netsol ball. Sixorg mysteriously disappeared from the web, and rather abruptly, when there was no gas left in the tank over at the world's most expensive domain registrar. I'm thinking of switching my portfolio of domains over to a cheaper service soon. Anyway, it's good to be back online and in the flow...I've got a few posts in the pipeline now, covering some very clever search technologies that are worth a closer look. Stay tuned for those later this week and next.

In the meantime, I've solved your Thanksgiving holiday hassle! This is all you need to impress your family and the parental units... Check it out during the commercial break, while I get my site back in full working order. More to come this holiday season! ;)
Continue reading...

Monday, August 20, 2007

Being Verity in a Google World

"Real Source Content / Result Federation is alive and well"
- J.W. Lehman, founder of Verity


I would be remiss in not taking the opportunity to respond to an interesting comment posted to my blog back in June on Federated Search. The response came from a founder of Verity, a leading enterprise search vendor acquired by Autonomy. The comments revitalize the debate surrounding the evolution of information retrieval and the evolution of information storage. To best clarify my position, and provide a rebuttal to the points raised by J.W., I'll provide my comments in-line and in bold following the poster's comments. The debate is on, right after the jump!

From J.W. Lehman,
founder of Verity

Real Source Content / Result Federation is alive and well

“Old Federated searchers never die, they just become…..”
anon

1. The poster hasn’t a clue about the purpose of federated search in information retrieval / research. Should “federated search” take the blame for slow/poor collection access? Of course not. Federated search is NOT, as the poster claims, an “interactive” single collection search mechanism, ala google, verity or any like…it’s a “watcher-monitor” of what is going on in the info-world in specific subject areas. If the poster told his enterprise customers they were getting google-for-the-deep-web, the poster just didn’t understand their requirements….typical for IR technology vendors, and VCs. Who cares if the answer takes 5 minutes or 5 hours? The purpose of federated search is sending-alerting new relevant material as it’s generated. Federated search is a very powerful, and quick, research assistant WHEN IT IS APPLIED PROPERLY.


Well, I think the opening paragraph pretty much says it all: "Who cares if the answer takes 5 minutes, or 5 hours?" Who indeed...hmmm. How about everyone. Everyone who's grown up with today's superior web search engines at their disposal. I would love to take a poll to see how many people would be willing to wait 5 minutes much less 5 hours for ANY form of search. But let us carry on.

We must first correct the poster's bold premise because federated search clearly does not belong in the 'watcher/monitor' category. Watcher/monitors have a very distinct membership that is quite different from federated search. RSS/Atom feed readers, dashboards, and RSS aggregators such as iGoogle, NetVibes, NewsGator, Bloglines, and OriginalSignal, are watcher/monitors. They are not federated search players whatsoever. Search is pull not push. It is active not passive. Feed aggregation is passive, and it pushes. Apples and oranges here.


Federated search supports COMMUNITIES OF INTEREST by replacing the incredibly complex need to individually access and merge content from all appropriate sources in the search for answers (regardless of their “fun-ness” to access), with a process that does it on command.

J.W. obviously hasn't read my other posting on Yahoo Pipes, and Google CSE. I offer up this as prerequisite reading material before claiming that big search engines can't address 'communities of interest' in a far easier and more powerful way.

If the user can’t wait 5 minutes or 5 whatevers for results that he/she couldn’t obtain in 5 weeks-months of manual effort, then the sources themselves must be unnecessary.

This remark invokes the proverbial 'wake up and smell the coffee' response. Every search engine in existence has invested millions in R & D and usability studies to unanimously confirm and conclude that speed matters. And it doesn't just matter, it is vital to achieving wide spread adoption and utility-- it is vital to survival. It is often difference between #1 in the industry and #100.

See this link for empirical evidence to this point. You'll find that user adoption, and user satisfaction is of paramount importance to the search experience. A 500ms drop in response time results in millions of abandoned searches and unsatisfied users. Can you imagine what would happen if these users had to wait 100 times as long? That would be 50 seconds. How about, as J.W. suggests, 1000 times as long? I think we can all predict the outcome.


The poster, and most of the rest of us, have fallen under the google-spell that time to first result and time-to-answer are the same. Not! How long does it take to find the fact/assumption/relationship in google/convera/verity/zylab/inxight result # 870? We’ll Never Find It, because we gave up after result 25.

This is no spell. This is reality. The world has evolved people. The majority of web surfers are a few of us Gen-x'ers, Gen-Y, Z, Millennials. Most were born into this world with a cell phone in hand, and broadband, and Wi-Fi everywhere. The expectation of always on, instant gratification, and real-time computing convenience is not a nice to have in today's world, it is now merely an assumed, necessary requirement. And, they are the best and brightest generations of our time.


2. “keyword” search? What century is the poster from? If you can’t explore content via explicit taxonomies with the searchrules to back them up, of course you’re going to get poor, mixed up results. [and not only is clustering is dead, dead, dead…, it was never alive!]

We do agree on one point above-- clustering is not ready for prime time. Beyond that, perhaps our differences are simply generational. I am part of the Internet generation, and not a day earlier. Let's be real folks, keyword search works, it works really, really well. It is undisputedly the fastest, most popular, and most effective universal mechanism for finding information today.

Today's keyword search engines are anything but just keywords today. But my discussion is not (and has not ever been) about keyword searching. It is about federated search, and its shortcomings, and why we must everything. But for the sake of discussion here's my quick take on the state of keyword search technology: Today's 'keyword search interpretation' technologies are more intelligent, proactive, interpretative, interpolative, and extrapolative than ever before. They are capable of much more than meets the eye. But that is the point, to keep it simple to the user, to appear as if the system is 'idiot proof' and the all it takes are a few simple keywords and magic happens. This is increasingly becoming the case today. More to do, this is certain. However, keyword search is still by far the most effective input mechanism to for matching information with your intent, even if you aren't fully aware of your intent nor fully knowledgeable on the subject you pursue. See an upcoming post titled: "Browsing the Web for Knowledge Using Keyword Search."

The industry deadpool is full of vendors that once hocked taxonomies, directories, and other structured content browsers. Taxonomies are great for very specialized collections of content, but they totally implode when mashed together by a federated search engine and 10 other content sources with totally different ontologies, categories, and metadata. It just doesn't work when blended together from completely different sources.


Index everything!!!!!!!!! Why bother? Keyword search will give you the same mess on an indexed collection…actually worse, because it’s only the rare and to-date, unpopular engine that recognized the presence of evidence at the meaningful text unit (i.e. paragraph) level….so instead of federated search telling you your “KEY-WORD” is actually in the title/snippet/abstract, you now get to discover the 1000x list of content where it’s anywhere in the full-text. What an advancement!

Why bother, hmmm...why indeed... Well, let's see... the last time someone got the idea to do this the right way, out popped a couple of life changing web companies with worldwide adoption and sustained valuations in the tens and hundreds of billions of dollars.

But here's a better reason: It just plain works.

The real problem here is that my counterpart is mixing metaphors for comparison sake by effectively equating federated search with concept search, and earlier with watcher/monitors which are both false equations. I'm not comparing methods of retrieval. I am focused on the virtues of storing all content in a single index. And just because we've indexed everything into a single source, does not mean that we are limited to mere keyword searching for information retrieval.

Every federated search engine, including Verity, when plugged into multiple sources for keyword searching does at least this much: pass the keyword queries to each content source wired to the federated search, and get results back from each, the keyword way. We know there are many other ways to retrieve content from a source, but this topic is and has always been about federated searching, not federated browsing, nor conceptual matching. All of which can still be done better with a single index of content anyway.


3. Result Federation…..The ability to de-dupe, de-mystify and normalize results from multiple relevancy determination techniques has been available for years…where have you been? All that’s necessary is to make a practical relevance determination of each result based upon the search request; and order it.

Regarding the existence of de-duping, etc. I distinctly don't recall saying anything to the contrary. I merely support the fact that all implementations to date do not work very well. Not one federated search engine can possibly make a reliable relevance determination based on the search query for one simple reason: it is not up to the federated engine to decide! The results that come in from each disparate content source are determined by the ranking and relevancy engine of each source's proprietary algorithm. Thus, even if the federated engine could magically infer the inter-source ranking with some degree of usefulness (though doubtful), the net results would only be as good as the worst ranking algo from the worst content source. Let's look at a simple illustration to clarify, shall we:

Step 1: Example query: nanotechnology fabrication

Step 2: Sources 1-5 are selected to 'federate' - assume sources 3-5 have terrible ranking engines

Step 3: The above keywords (yes keywords J.W.) are passed to each sources' query engine

Step 4: The "top ten" results are returned from each source's relevancy engine

Step 5: the 50 results are some how re-ranked based on the nature of the query? I'd like to see that. Especially since the results returns are merely title, snippet, URL, and NOT full-text. As is the case with every standard enterprise and web search engine index.

Step 6: Regardless, sources 3-5 poorly ranked documents make it impossible to unify the ranking in anything but a largely arbitrary way, and giving arbitrary credibility of the results list.

Step 7: Because the federation technology has no way to evaluate how well a given source is ranking its own documents, it is impossible to establish a consistently high quality set of ordered results, using this antiquated yet widely suggested way of federating.


4. In any subject, google-yahoo-ms-altavista-etc, lets you find out what everyone
else already knows…..the ability to find out what nobody else knows/surmises is
virtually denied.

This belief makes one heck of a gross assumptions as to the way in which any of the aforementioned engines employ page ranking. Discovery is purely a function of the nature of the access methods to the information source, all other things being equal. With a single index of content I can create discover, knowledge, connectedness, and relatedness of concepts, sentences, subjects, and more without the need for federating a single thing. It was called Grokker 2.3 Desktop for Google, back in 2004. Today its called Google CSE for a single source, and for multiple sources its called Yahoo Pipes.


That is what federated search is for … multi-disciplined
communities of interest seeking answers to advance knowledge, as opposed to
wikipedias-google results.

June 11, 2007 3:35 PM


Federated search as it exists today is not a social medium, and it was never intended to be. Collaborative filtering, collective intelligence on the other hand, is the future today. Has someone slept through the web2.0 phenom? digg, delicious, feedburner, flickr, Wize, Yelp, Google Reader, iGoogle. Web 2.0 companies have already categorically taken this aging notion of 'communities of interest' via metasearch tools and turned it upside down-- and actually made it work for the first time. And while all of these new web services aggregate content from a huge multiple of sources, they are not federated search engines in any sense of the word, as I have described in all of my postings.

What's more, equating or limiting the definition of federated search to apply only to research/enterprise content versus searching public WWW content, is a significant misnomer.

For if the best of today's web search engines were to index ALL of the available high quality, structured enterprise/research content behind the firewall (which now a few of them are doing, btw), I could then profess the end of old-school federated search, that has plagued enterprises, universities, and the world at large for over a decade now. Giving way to entirely new ways of federating, classifying, categorizing content-- but from a universal index of content with standardized metadata and shared ranking algorithms.

So my position remains unchanged, if not reinforced. The doctor has checked the patient for a pulse, and she's still dead as a doornail. Good night and good bye my dear federator...

Continue reading...

Friday, July 6, 2007

Powerset's Q & A vs. Keyword Search

Powerset would likely be the first to promote Natural Language Processing, NLP, as the future of search. Their recent blog post provokes a few interesting debates about the premise of their approach to improving Web search as we know it today. In theory, natural language processing is a very attractive method of human-computer interaction. In practice, it still has its limitations.

English is particularly challenging in this regard because it has little inflectional morphology to distinguish between parts of speech. Wikipedia has a simple little example to illustrate this point:

English and several other languages don't specify which word an adjective applies to. For example, in the string "pretty little girls' school".
  • Does the school look little?
  • Do the girls look little?
  • Do the girls look pretty?
  • Does the school look pretty?
This language code is very tricky to decipher with a highest degree of accuracy and consistency necessary to provide an acceptable user experience. The full story after the jump...

Powerset will attempt to solve this problem with NLP and the creation of what must be an insanely massive library of ontologies in attempts to contextualize all the Web. A bold undertaking indeed. But let's set aside their pending solution and look at the potential impact to the user experience an NLP-based system would introduce. NLP works best with well formed questions, phrases, and 'contextual' descriptions. You'd be hard pressed to find NLP making improvements to results returned for some types of typical queries such as: "weather 94107" or "paris hilton" or "the police concert tour dates"

So the question becomes this: What percentage of all Web searches would truly benefit from NLP style queries? Is it enough to make it universal or stand on its own? Or it is better served as an enhancement or feature add-on to existing web search offerings. Me thinks it is the latter. Feature, product, business. Remember the FPB test. All technologies and ideas fall into one of the three.

NLP prefers the user to formulate semi-structured sentences to produce the best or most noticeably improved results when compared to traditional keyword searches. As stated above, this can be very handy for certain types of searches, without question. But what happens if your sentence is poorly written? What if your English, French, or Spanish language skills are not up to par? What if you are unfamiliar with the host's ontologies and vocabularies for a new research topic you want to explore? Can NLP produce better results in the absence of accurate or sufficient natural language input? And what of the content being retrieved? What if it too is miscategorized, or poorly structured text?

A common solution is: categorization, classification, and taxonomic organization of content. Another is to predetermine a vocabulary for a given topic of information. Ontologies as they are better or lesser known, for any genre of information, be it politics, sports, or nanotechnology are thereby subject to the vast interpretation of the authors that create them. These authors assign meaning in ways that could be interpreted much differently from how other people, cultures, and languages understand them to be. This could create incongruence between the question and the answer, er...between the query and the results.

Another interesting data point to bear in mind: Web searchers today are actually quite efficient and effective with keyword searching, enhanced further by increasing fluency with boolean and other advanced search operators. As such, keyword searching is often (but not always) hyper-efficient at getting the user precisely what they are looking for. Let us also remember that "keyword search" per se, doesn't necessarily equate to "keyword matching" as the sole or even primary means by which related content is return from a traditional Web search index. Today's top search engine algorithms are far more complex than simple keyword matching, counting, and/or extraction. In fact, some components of page ranking, relevance, and ordering of results pages are language/text independent. Rather, they rely on the organic substructure of the Web, and its interconnections between information that helps to paint the picture related or important subject matter. This helps tremendously in dealing with the Wild, Wild, Web that is fraught with unstructured text, errors in spelling, inconsistent or incomplete grammar and the like found in millions of web pages around the world.

A lot of people in the industry like to assert that "not much has changed with web search over the past several years" which couldn't be further from the truth. The major search engines are enhancing their core search algo's multiple times per week in fact. The problem is that they (non-search experts, journalists, analysts) base their assessments on what they read or don't read about search in the press. Alternatively, they (new search upstarts and old dying breeds in search and enterprise search) are simply in denial, and keep telling themselves that search hasn't changed to help justify a withering existence.

But do not fret (too much anyway) all is not lost. I do believe there are a few definitive paths to success in the web search industry for new companies with the right idea-- but only those that come prepared with their eyes wide open, and a very realistic view of where search truly is today, and where the world of end-users is influencing it from here. Without an accurate view, let's be honest, they're pretty much dead.

As for Powerset, I have to believe they've embraced this exercise, but only time will tell. Let's see how they debut later this year.

Continue reading...

Thursday, June 7, 2007

iSearch, uSearch

Just returned from a month in Asia and Australia (inset pix of Shanghai nightlife). Fascinating centers of innovation from Tokyo to Singapore to Sydney. All have some interesting twists to next-gen Web applications and search. But that's for another time, another post. Today it's time we revisit the death of federated search (aka metasearch, single search, etc.) as we know it, and share a glimpse of what the future holds for finally solving the very elusive problem of getting at all of our information as easily as we should be able to. Ok, just a moment...ok, yep just checked, and it's still dead, dead as a doornail.

My friends and colleagues at Stanford University library and info-sciences department have been researching this problem head-on with over 700 databases of searchable research and academic content at their disposal. They are not alone. Countless universities, companies, and web services at large have found themselves at the end of the same dead end road.

Single search doesn't work, nor does traditional metasearch, or any other twists on federated search. Clustering metasearched results from multiple sources into artificial categories or groups only exacerbates the problem, and thoroughly confuses the end user. (sidebar: clustering has a very long way to go before it is anywhere near ready for prime time public consumption. Until then, it has no business being in search) These approaches have, in fact, proven pointless and only further delay any attempts to arrive at an acceptable user experience for effectively accessing a multitude of content sources simultaneously. I would go as far as saying that these so called solutions are robbing these important customers of their youth. Costing not just hundreds of thousands in license fees, but years of setbacks and distractions dealing with totally ineffective solutions. Everybody seems to have an angle that to me is nothing short of amusing. You'll notice through that link several spins on the same broken solution. I reviewed everything listed in those results. RIP.

There is a reason why basic search remains so widely popular, effective, and accepted by the vast majority of info seekers. Because it works. Because it is simple and intuitive. People get it. What people don't get are kludgey attempts to mash a bunch of square pegs into a round hole. If you look at the quality of search results from any of the tens or hundreds of enterprise search vendors, metasearch peddlers, and then say, Google, what you'll find might surprise you. Or maybe it won't. Yes, obviously Google.com works, and Google Search Appliance is no different. GSA stuck to its roots from Google.com for a reason: simple and intuitive user experience and high quality results-- from ONE source. Today GSA can crawl and index virtually any type of info object or database in existence. Why bother promoting new content in separate databases? This only adds to the problem. And with Google OneBox, we go even further, wiring competing content management systems to a better Google-controlled search experience.

So just what am I getting at? No, I'm not pimping Google's 'wares, but I am using them as one of only a few early examples of how to correctly begin to approach this problem. The answer is simple. One source. One index. One search interface. The fact that 700 databases sit in front of the info seeker is the real problem. There is no cohesive data model to support any meaningful metasearch whatsoever. "Normalizing" the boolean structure of the query language for each source's retrieval method was thought to 'standardized' the results that come back from all these random content sources. Not so. For it is not the query that matters, rather it is how the content is indexed. Just because the genre or subject nature of two content databases appears to be 'related' does not imply that the returned results will be the best combination of the two sources. Why? Because they have completely independent relational structures, metadata schemas, and ontologies.

Federated search, as we knew it before it died, did nothing more than mask this problem with a bland search interface wrapped around a broken and discontinuous distributed data model. Despite the cold reality, many of you still employ this type of solution at an increasingly expensive cost to your company and to your users' productivity.

But let's get back to the answer. Google introduced Universal Search, after quietly testing the concept under an alias website: searchmash.com. Yep, they really do. Universal Search is not there yet, but it is a move in the right direction. Yes, even Google faced a minor federation/metasearch problem as they continued to grow laterally into new content categories, e.g. News, Photos, Videos, Blogs, Products, Scholar, etc... As a result, it became increasingly unclear whether Google.com was the right place to start a search with so many alternate entry points that may be more appropriate for certain searches, e.g.: blogsearch.google.com, or news.google.com, and many more.

Universal Search is an early attempt to give the user a little taste of everything: pictures, videos, blogs, news, and web search results in one result page. Check out this basic example here for Steve Jobs. You get what I'm saying. Now, this doesn't exactly scale if you have 20, 30, or 700 types of content, or content sources to display on a page. They simply wouldn't fit. Additionally, Universal Search is more about displaying content of different types or formats versus merely different sources of content. For example, web pages, news articles, pictures, and videos are all very different types of content. I have designed two unique ways to address this problem, following some of the principles of Universal Search. Enter Integrated Search.

The integration of content sources is where we begin. The devil is most certainly in the details for this design and implementation, but here is the gist:

Recipe for Integrated Search

Ingredients

n parts of unique content sources
1 part really nice crawler/indexer (Nutch, GSA, or Lucene)
1 part high quality query interface with boolean translators, NLP, and auto completion and suggestion. (See CiteSeer or ACM for several)

Frappé all ingredients until smooth. Let stand and cool for 10 minutes.
Season to taste with one or both of the following:

1 search index inverter (yes, the secret sauce)
A dash of user intent interpolation at the point of query


This solves 3 problems at once. A single index, so that no sources need be considered at query time, ever. Smart pre-query processing to help guide the search query to match the users' intent. (We'll discuss intent-driven searches, or lack thereof, in an upcoming post.) And a powerful index/ranker to ensure that every content object in the index, from every original source is uniformly considered when ordering and displaying the results that best match the query.

This is NOT the case with traditional federators, which do nothing more than combine search results from hundreds of different indexing methodologies, with absolutely no way to 'honestly' or intelligently rank and order results that come from different indexers and ranking algos.

So even without revealing the secret sauce, you can see how this approach is fast, simple, and aligned with traditional search user experiences. The hard part? Crawling all the content sources means writing system adapters to content to the weirdest of old school flat file DB's, obscure object databases, and a whole lot worse. But if you pick a good crawler or general search product, much of that hacking has been done for you, as with Google's Search Appliance and their 220+ adapters that work pretty well out of the Box, pun intended.

So about that secret sauce? Well with a good inference about the user's intent we can bias the search results to better cater to the user's objective. And as for index inverting, its really about inverting the results that come from the index, for a given query. Ever curious what results actually appear at the end of a big web search with 5,400,000 results? How about dead middle of those 5.4 mil? Curious aren't we? Yes, it's all about discovery, and those deeper results can more useful that you might think.

As screen real estate continues to increase on the desktop/laptop, we'll no doubt continue to see search results get 'fatter' as in wider across the page. Yes, two and three column search results are on the way. And wait till you see where the ads turn up. For search its just the beginning. For federated search, well maybe we'll call it a new beginning. But for them, this means starting over. Completely.

So far I've yet to see any legitimate newcomers enter the arena to take up this challenge/opportunity head-on. In the meantime, partial solutions are manifesting within Web search while Google, Yahoo, and Ask continue to advance some good ideas in this arena. Yes, even Ask has been doing 'Unified' Search on their home page for a while now, and it's actually a reasonably clean UI...try out this query: iPhone be sure to stretch your browser as wide as it will go...not bad.

Integrated Search, iSearch. Coming to a theater near you? We'll soon find out...

Continue reading...

Friday, April 27, 2007

The Graphical Information Interface - Made Simple

We pioneered them, and coined the 'GII' at Groxis back in 2001. The significance of new interfaces to information transcends that of any one company and any one particular problem to be solved. The sheer mass of information that we now process on any given day is orders of magnitude greater than ever before. As such, our means by which to effectively absorb, leverage, and filter information flows (or floods) must also evolve. Enter the GII. We created Grokker with the idea that information needed to be liberated from the confines of HTML and the web browser as we knew it years ago. It was our "1.0" attempt to create a universal information currency with the capacity to convey far more usefulness than a list of 10 search results on a web page. Grokker sparked a small but potent movement along these lines, paving the way for new approaches to opening the information bottleneck at the point of consumption. Many great advances have emerged, showing promise for the future of information experiences to come.

Sixorg is all about information, search, and the like. As such, you can expect to find many cool, new examples of the GII that fit the bill in future posts. Today I'm sharing a 'blog find' that ranks high on the list for one of my good friends, and is now high on my list. It's called Indexed by Jessica Hagy. In a word, brilliant. A witty and creative smattering of hand drawn Venn diagrams, scatter graphs, and more, representing mathematical theorem proofs of every day life- infoviz style. The clever and entertaining diagrams and graphs convey information in spades- further proving another theorem that a picture is worth a thousands words. Indexed is a simple but great example of GII's in action. There is no reason to explain it further, because the examples so clearly speak for themselves. 'Nuff said.

Take a look, bookmark Indexed. It's fresh air for your creative mind. Thanks to my friend and colleague Rosie for the link. Rosie doesn't have a blog, but should. Girl's got skills. Maybe this will kick her in gear to get her game on! More to come. Stay tuned...

Continue reading...

Tuesday, April 3, 2007

Personalized Google Mashups - On The Fly

If you haven't used JSON, you're missing out. If you haven't heard of it, your just out of it period. JSON is a great data interchange format, that Google utilizes to streamline their first mashup wizard for Google Maps. It's a simple alternative to coding (certain) server-side proxy's for http requests to get to data in the form of JSON feeds. JSON liberated this extremely cool mashup wizard at Google a few days ago. Zero coding required to build very useful Google maps mashups of your own from your own Google Spreadsheet table. Reminds me of XQuery's thin client-side data extraction properties. Not surprising. Hmmm...XQuery for JSON...we could really be on to something. At any rate, for this example, you have to get your data into Google's Spreadsheet first, but that's far simpler that coding a mashup from scratch. This is the power of great front-side middleware, making custom app building truly user friendly. An excellent step forward that will no doubt unleash a new bevy of corporate, personal, and startup mashups. My first mashup to follow...

Continue reading...

Federated Search is dead, dead, dead...

I've been asked to write about this for some time, and that time has finally come. O'Reilly touched on the subject recently discussing Google's plans in this arena. But white papers on the subject are not required to explain why traditional approaches to this dilemma are toast. In this post I'll explain why federation is broken and how corporations, universities, and start-ups continue to throw $ at the wrong end of the problem...click below to dive in!

Federation defined
Federated search is the art of attempting to execute a single keyword search across n number of databases, content sources, indexes, news feeds, etc. This is also known as metasearch, deep-web search, and content aggregation. No central federated index is maintained and no crawling or spidering required. The idea, in theory anyway, is certainly a convenient one. At Groxis, 90% of our customers were most interested in federating their enterprise content sources. In large companies and universities alike the sea of available content silos for any given organization is vast. It is not uncommon to find hundreds, even thousands of content sources used across a single organization.

Federation is all about passing the user's search query separately to each of those search engines, and collecting x number of search results from each source, and then figuring out how to display them to the user in some meaningful, useful, actionable way. This display challenge occurs because typically many of the content sources are not crawlable or spiderable due to licensing issues, ownership of the content, or because the content resides in a data store that is not crawable. Examples: SQL databases, proprietary content management systems, and commercial content such Lexis, Factiva, Reuters, etc. Federated search is a quick and dirty way to scan across a vast array of content sources.

Federated challenges
However, there is a fundamental usability and search logic problem with today's generic search federation. Let's presume you are federating just 20 content sources into a single search query interface. Using a traditional search results display format, 10 results per page, we arrive at problem #1: results ordering and display.
  1. What happens if all 20 sources return an average of 60 results for a given query? How are the results combined and displayed intelligently? The first ten results have only a chance of display at best, 50% of the breadth of the corpus. From a usability standpoint, federation demands a results display that best accommodates breadth and depth simultaneously.
  2. Each result set from each source uses its own unique 'relevance' ranking algorithm. Once you have the ordered result set from each source, how do you compare and order the combined results across different data sources?
Arbitrary solutions (aka common hacks)
A. Should we apply a weighting alogithm to each of the sources to favor more 'important' sources. Sure we could. But this arbitrary not contextual, and thus totally inefficient.

B. Should we apply speed? First results to come back get displayed first? Hardly contextual, hardly consistent nor sufficient. A poor man's federation to be sure. More on performance issues later.

C. How about ordering all the results into topic clusters? Sounds great, this allows us to organize all the results from all of our 20 sources into a cluster map, organized by topics, not content sources. On the surface this could indeed address some of federations shortcomings. However the problem is that topic clustering technology is woefully inadequate for serious research or just serious federation. I've reviewed, licensed, and tested every serious clustering engine in development, and even hacked together my own clustering algorithms over the past several years. They all have a common problem: They require optimization and customization to each and every content source, and never work consistently enough to overcome mass user adoption. They require unique stop word lists, phrase delineations, dictionaries, cluster tuning, label tuning, and a host of other tweaks. I could go deep here, but let's not get off topic. In fact, wait for my next posting that illustrates why document clustering is also dead, dead, dead.

I mentioned speed above. The other big usability problem is the speed at which each source returns results. Often times we cannot produce a combined results set because the federation engine is waiting on sources to return with their results. Some sources can be woefully slow, causing totaly response times to take up to 3-5 minutes! Yes, I've seen this in production at large enterprise sites. This is how to cream mass user adoption in about exactly... 3-5 minutes.

D. Another common 'solution' is to let the user pre-select the content sources from which to federate the keyword search. Sounds reasonable on the surface, until you have 20 or 700 data sources to choose from. Even grouping them together leaves too much to the imagination from a usability standpoint. User's aren't trained to 'think' about these intricacies, they just search and go. Advanced Search panes are rarely utilized correctly, if at all. Further, most users will know much less about each content source than the federation platform does. As such, having source selection choices is a massive burden on the user if there are more than 7-12 sources to choose from. In the end, this does not solve the problem, in most cases it adds to it.

The real solution - does one really exist?
Wouldn't it be nice if there were a simple, elegant solution to this most vexing problem? Librarians, universities, researchers, and knowledge enterprises would rejoice with a resounding thunder! And the company with the solution would similarly rejoice in the prying open of even the tightest purse strings of customers vying to get their hands on the proven solution once and for all.

Well, there is good news and bad news. The good news, there is an obvious solution. The bad news is, that is really, really, really hard to do. The solution: index everything. (Note: this is not the same as metasearch, which only aggregates results from separate search engines, as metasearch has no indexing capability...for now ;) One index one result set for all content online. Yes, I said it. If literally every content source were opened up to be crawled and indexed without prejudice, a single, uniform index could go to work providing users the most useful results from a single search. Sort of like removing DRM from digital music, in a way, I suppose. Let the content be free! The difference being, premium content publishers would not have to open up the body of the content to the end user. Just look at how Yahoo and others handle searching premium content. You can access the metadata (title, author, abstract, summary, etc.) and then you pay to gain access to the full text. [ Paying for content is yet another topic all together. Yes, it too is dead, dead, dead...] Yahoo's subscription content federation is an example of the "index everything" solution on a much smaller scale. Though this implementation is only partially effective here and for only small groups of content sources.

In theory a web index such as Google.com, is a federated index of sorts, culling together millions of small and large 'content sources' known as websites into a centralized search index. Fundamentally no different from metasearching, but architecturally and contextually vastly different user experiences and effectiveness.

The bad news is also obvious. It is seemingly impossible to get all content sources opened up, and indexed anytime soon. Not to mention the privacy, copyright, formatting, and global policy issues that surround the notion. Just look at all the flak Google gets for scanning books in a library. Given this, might there be another way? Another approach that achieves maximum usability, and extracts maximum value from any cross section of content sources for the user, the researcher, the knowledge worker? I believe there is such a solution in development today. For hints as to the direction of such an approach, let me point you to a few successful 'mini-federators' in the web2.0 world that are really effective.
  • Take a look at: Original Signal (look beyond their new blog style home page to the 'channels' of aggregated content) - a simple example to be sure, and nothing breakthrough per say with the user experience. Rather effective just the same.
  • Take a look at the approach taken by Yahoo, and improved by Google with the 'personalized' home pages that allow you to customize your content aggregation into an RSS + Ajax dashboard of sorts. There are scores of web2.0 RSS aggregators and some really clever dashboards out there, that are planting the seeds for something much bigger.
  • But those are what I call the 'lay-ups' or the obvious choices. Less obvious but closer to what the future holds for federation include: Google CSE and Yahoo Pipes -- think social computing meets vertical search while killing metasearch...
The current design of the 'dashboard' as we know it does not scale to support n number of content sources, and certainly not 700 or 1000, but CSE and Pipes are a very different story. Essentially do-it-yourself federated indexes as I described earlier. Very high potential. Particularly as screen real-estate runs at and all time premium today. As such, if we are to arrive at true front-end solution versus a back-end (index everything) solution, it has to scale. It must also remain simple, efficient, and require an almost zero learning curve. New visual metaphors have only exacerbated the adoption and usability problems that plague most federation solutions in the market today.

If search federation sounds like a rather elusive problem to solve, I can promise you, elusive is an understatement. The answer lies in how we interact with, process, and digest information instinctively, not 'intuitively' as most info designers would have you believe. Intuition-driven approaches only lead to new products and solutions that chase their own tail, never really solving the problem at hand. We have seen, and will see yet more companies come and go with their valiant attempts to crack the code for federated search. But until the real problems with federation are truly understood, be prepared for more tail wagging. Ironically enough however, it appears to me a solution might soon be launched...right under our noses...hmmm. As always, stay tuned!

Continue reading...