Showing posts with label web search. Show all posts
Showing posts with label web search. Show all posts

Monday, August 20, 2007

Being Verity in a Google World

"Real Source Content / Result Federation is alive and well"
- J.W. Lehman, founder of Verity


I would be remiss in not taking the opportunity to respond to an interesting comment posted to my blog back in June on Federated Search. The response came from a founder of Verity, a leading enterprise search vendor acquired by Autonomy. The comments revitalize the debate surrounding the evolution of information retrieval and the evolution of information storage. To best clarify my position, and provide a rebuttal to the points raised by J.W., I'll provide my comments in-line and in bold following the poster's comments. The debate is on, right after the jump!

From J.W. Lehman,
founder of Verity

Real Source Content / Result Federation is alive and well

“Old Federated searchers never die, they just become…..”
anon

1. The poster hasn’t a clue about the purpose of federated search in information retrieval / research. Should “federated search” take the blame for slow/poor collection access? Of course not. Federated search is NOT, as the poster claims, an “interactive” single collection search mechanism, ala google, verity or any like…it’s a “watcher-monitor” of what is going on in the info-world in specific subject areas. If the poster told his enterprise customers they were getting google-for-the-deep-web, the poster just didn’t understand their requirements….typical for IR technology vendors, and VCs. Who cares if the answer takes 5 minutes or 5 hours? The purpose of federated search is sending-alerting new relevant material as it’s generated. Federated search is a very powerful, and quick, research assistant WHEN IT IS APPLIED PROPERLY.


Well, I think the opening paragraph pretty much says it all: "Who cares if the answer takes 5 minutes, or 5 hours?" Who indeed...hmmm. How about everyone. Everyone who's grown up with today's superior web search engines at their disposal. I would love to take a poll to see how many people would be willing to wait 5 minutes much less 5 hours for ANY form of search. But let us carry on.

We must first correct the poster's bold premise because federated search clearly does not belong in the 'watcher/monitor' category. Watcher/monitors have a very distinct membership that is quite different from federated search. RSS/Atom feed readers, dashboards, and RSS aggregators such as iGoogle, NetVibes, NewsGator, Bloglines, and OriginalSignal, are watcher/monitors. They are not federated search players whatsoever. Search is pull not push. It is active not passive. Feed aggregation is passive, and it pushes. Apples and oranges here.


Federated search supports COMMUNITIES OF INTEREST by replacing the incredibly complex need to individually access and merge content from all appropriate sources in the search for answers (regardless of their “fun-ness” to access), with a process that does it on command.

J.W. obviously hasn't read my other posting on Yahoo Pipes, and Google CSE. I offer up this as prerequisite reading material before claiming that big search engines can't address 'communities of interest' in a far easier and more powerful way.

If the user can’t wait 5 minutes or 5 whatevers for results that he/she couldn’t obtain in 5 weeks-months of manual effort, then the sources themselves must be unnecessary.

This remark invokes the proverbial 'wake up and smell the coffee' response. Every search engine in existence has invested millions in R & D and usability studies to unanimously confirm and conclude that speed matters. And it doesn't just matter, it is vital to achieving wide spread adoption and utility-- it is vital to survival. It is often difference between #1 in the industry and #100.

See this link for empirical evidence to this point. You'll find that user adoption, and user satisfaction is of paramount importance to the search experience. A 500ms drop in response time results in millions of abandoned searches and unsatisfied users. Can you imagine what would happen if these users had to wait 100 times as long? That would be 50 seconds. How about, as J.W. suggests, 1000 times as long? I think we can all predict the outcome.


The poster, and most of the rest of us, have fallen under the google-spell that time to first result and time-to-answer are the same. Not! How long does it take to find the fact/assumption/relationship in google/convera/verity/zylab/inxight result # 870? We’ll Never Find It, because we gave up after result 25.

This is no spell. This is reality. The world has evolved people. The majority of web surfers are a few of us Gen-x'ers, Gen-Y, Z, Millennials. Most were born into this world with a cell phone in hand, and broadband, and Wi-Fi everywhere. The expectation of always on, instant gratification, and real-time computing convenience is not a nice to have in today's world, it is now merely an assumed, necessary requirement. And, they are the best and brightest generations of our time.


2. “keyword” search? What century is the poster from? If you can’t explore content via explicit taxonomies with the searchrules to back them up, of course you’re going to get poor, mixed up results. [and not only is clustering is dead, dead, dead…, it was never alive!]

We do agree on one point above-- clustering is not ready for prime time. Beyond that, perhaps our differences are simply generational. I am part of the Internet generation, and not a day earlier. Let's be real folks, keyword search works, it works really, really well. It is undisputedly the fastest, most popular, and most effective universal mechanism for finding information today.

Today's keyword search engines are anything but just keywords today. But my discussion is not (and has not ever been) about keyword searching. It is about federated search, and its shortcomings, and why we must everything. But for the sake of discussion here's my quick take on the state of keyword search technology: Today's 'keyword search interpretation' technologies are more intelligent, proactive, interpretative, interpolative, and extrapolative than ever before. They are capable of much more than meets the eye. But that is the point, to keep it simple to the user, to appear as if the system is 'idiot proof' and the all it takes are a few simple keywords and magic happens. This is increasingly becoming the case today. More to do, this is certain. However, keyword search is still by far the most effective input mechanism to for matching information with your intent, even if you aren't fully aware of your intent nor fully knowledgeable on the subject you pursue. See an upcoming post titled: "Browsing the Web for Knowledge Using Keyword Search."

The industry deadpool is full of vendors that once hocked taxonomies, directories, and other structured content browsers. Taxonomies are great for very specialized collections of content, but they totally implode when mashed together by a federated search engine and 10 other content sources with totally different ontologies, categories, and metadata. It just doesn't work when blended together from completely different sources.


Index everything!!!!!!!!! Why bother? Keyword search will give you the same mess on an indexed collection…actually worse, because it’s only the rare and to-date, unpopular engine that recognized the presence of evidence at the meaningful text unit (i.e. paragraph) level….so instead of federated search telling you your “KEY-WORD” is actually in the title/snippet/abstract, you now get to discover the 1000x list of content where it’s anywhere in the full-text. What an advancement!

Why bother, hmmm...why indeed... Well, let's see... the last time someone got the idea to do this the right way, out popped a couple of life changing web companies with worldwide adoption and sustained valuations in the tens and hundreds of billions of dollars.


But here's a better reason: It just plain works.

The real problem here is that my counterpart is mixing metaphors for comparison sake by effectively equating federated search with concept search, and earlier with watcher/monitors which are both false equations. I'm not comparing methods of retrieval. I am focused on the virtues of storing all content in a single index. And just because we've indexed everything into a single source, does not mean that we are limited to mere keyword searching for information retrieval.

Every federated search engine, including Verity, when plugged into multiple sources for keyword searching does at least this much: pass the keyword queries to each content source wired to the federated search, and get results back from each, the keyword way. We know there are many other ways to retrieve content from a source, but this topic is and has always been about federated searching, not federated browsing, nor conceptual matching. All of which can still be done better with a single index of content anyway.


3. Result Federation…..The ability to de-dupe, de-mystify and normalize results from multiple relevancy determination techniques has been available for years…where have you been? All that’s necessary is to make a practical relevance determination of each result based upon the search request; and order it.

Regarding the existence of de-duping, etc. I distinctly don't recall saying anything to the contrary. I merely support the fact that all implementations to date do not work very well. Not one federated search engine can possibly make a reliable relevance determination based on the search query for one simple reason: it is not up to the federated engine to decide! The results that come in from each disparate content source are determined by the ranking and relevancy engine of each source's proprietary algorithm. Thus, even if the federated engine could magically infer the inter-source ranking with some degree of usefulness (though doubtful), the net results would only be as good as the worst ranking algo from the worst content source. Let's look at a simple illustration to clarify, shall we:

Step 1: Example query: nanotechnology fabrication

Step 2: Sources 1-5 are selected to 'federate' - assume sources 3-5 have terrible ranking engines

Step 3: The above keywords (yes keywords J.W.) are passed to each sources' query engine

Step 4: The "top ten" results are returned from each source's relevancy engine

Step 5: the 50 results are some how re-ranked based on the nature of the query? I'd like to see that. Especially since the results returns are merely title, snippet, URL, and NOT full-text. As is the case with every standard enterprise and web search engine index.

Step 6: Regardless, sources 3-5 poorly ranked documents make it impossible to unify the ranking in anything but a largely arbitrary way, and giving arbitrary credibility of the results list.

Step 7: Because the federation technology has no way to evaluate how well a given source is ranking its own documents, it is impossible to establish a consistently high quality set of ordered results, using this antiquated yet widely suggested way of federating.


4. In any subject, google-yahoo-ms-altavista-etc, lets you find out what everyone
else already knows…..the ability to find out what nobody else knows/surmises is
virtually denied.

This belief makes one heck of a gross assumptions as to the way in which any of the aforementioned engines employ page ranking. Discovery is purely a function of the nature of the access methods to the information source, all other things being equal. With a single index of content I can create discover, knowledge, connectedness, and relatedness of concepts, sentences, subjects, and more without the need for federating a single thing. It was called Grokker 2.3 Desktop for Google, back in 2004. Today its called Google CSE for a single source, and for multiple sources its called Yahoo Pipes.


That is what federated search is for … multi-disciplined
communities of interest seeking answers to advance knowledge, as opposed to
wikipedias-google results.

June 11, 2007 3:35 PM


Federated search as it exists today is not a social medium, and it was never intended to be. Collaborative filtering, collective intelligence on the other hand, is the future today. Has someone slept through the web2.0 phenom? digg, delicious, feedburner, flickr, Wize, Yelp, Google Reader, iGoogle. Web 2.0 companies have already categorically taken this aging notion of 'communities of interest' via metasearch tools and turned it upside down-- and actually made it work for the first time. And while all of these new web services aggregate content from a huge multiple of sources, they are not federated search engines in any sense of the word, as I have described in all of my postings.

What's more, equating or limiting the definition of federated search to apply only to research/enterprise content versus searching public WWW content, is a significant misnomer.

For if the best of today's web search engines were to index ALL of the available high quality, structured enterprise/research content behind the firewall (which now a few of them are doing, btw), I could then profess the end of old-school federated search, that has plagued enterprises, universities, and the world at large for over a decade now. Giving way to entirely new ways of federating, classifying, categorizing content-- but from a universal index of content with standardized metadata and shared ranking algorithms.

So my position remains unchanged, if not reinforced. The doctor has checked the patient for a pulse, and she's still dead as a doornail. Good night and good bye my dear federator...

Read the full story

Friday, July 6, 2007

Powerset's Q & A vs. Keyword Search

Powerset would likely be the first to promote Natural Language Processing, NLP, as the future of search. Their recent blog post provokes a few interesting debates about the premise of their approach to improving Web search as we know it today. In theory, natural language processing is a very attractive method of human-computer interaction. In practice, it still has its limitations.

English is particularly challenging in this regard because it has little inflectional morphology to distinguish between parts of speech. Wikipedia has a simple little example to illustrate this point:

English and several other languages don't specify which word an adjective applies to. For example, in the string "pretty little girls' school".

  • Does the school look little?
  • Do the girls look little?
  • Do the girls look pretty?
  • Does the school look pretty?
This language code is very tricky to decipher with a highest degree of accuracy and consistency necessary to provide an acceptable user experience. The full story after the jump...

Powerset will attempt to solve this problem with NLP and the creation of what must be an insanely massive library of ontologies in attempts to contextualize all the Web. A bold undertaking indeed. But let's set aside their pending solution and look at the potential impact to the user experience an NLP-based system would introduce. NLP works best with well formed questions, phrases, and 'contextual' descriptions. You'd be hard pressed to find NLP making improvements to results returned for some types of typical queries such as: "weather 94107" or "paris hilton" or "the police concert tour dates"

So the question becomes this: What percentage of all Web searches would truly benefit from NLP style queries? Is it enough to make it universal or stand on its own? Or it is better served as an enhancement or feature add-on to existing web search offerings. Me thinks it is the latter. Feature, product, business. Remember the FPB test. All technologies and ideas fall into one of the three.

NLP prefers the user to formulate semi-structured sentences to produce the best or most noticeably improved results when compared to traditional keyword searches. As stated above, this can be very handy for certain types of searches, without question. But what happens if your sentence is poorly written? What if your English, French, or Spanish language skills are not up to par? What if you are unfamiliar with the host's ontologies and vocabularies for a new research topic you want to explore? Can NLP produce better results in the absence of accurate or sufficient natural language input? And what of the content being retrieved? What if it too is miscategorized, or poorly structured text?

A common solution is: categorization, classification, and taxonomic organization of content. Another is to predetermine a vocabulary for a given topic of information. Ontologies as they are better or lesser known, for any genre of information, be it politics, sports, or nanotechnology are thereby subject to the vast interpretation of the authors that create them. These authors assign meaning in ways that could be interpreted much differently from how other people, cultures, and languages understand them to be. This could create incongruence between the question and the answer, er...between the query and the results.

Another interesting data point to bear in mind: Web searchers today are actually quite efficient and effective with keyword searching, enhanced further by increasing fluency with boolean and other advanced search operators. As such, keyword searching is often (but not always) hyper-efficient at getting the user precisely what they are looking for. Let us also remember that "keyword search" per se, doesn't necessarily equate to "keyword matching" as the sole or even primary means by which related content is return from a traditional Web search index. Today's top search engine algorithms are far more complex than simple keyword matching, counting, and/or extraction. In fact, some components of page ranking, relevance, and ordering of results pages are language/text independent. Rather, they rely on the organic substructure of the Web, and its interconnections between information that helps to paint the picture related or important subject matter. This helps tremendously in dealing with the Wild, Wild, Web that is fraught with unstructured text, errors in spelling, inconsistent or incomplete grammar and the like found in millions of web pages around the world.

A lot of people in the industry like to assert that "not much has changed with web search over the past several years" which couldn't be further from the truth. The major search engines are enhancing their core search algo's multiple times per week in fact. The problem is that they (non-search experts, journalists, analysts) base their assessments on what they read or don't read about search in the press. Alternatively, they (new search upstarts and old dying breeds in search and enterprise search) are simply in denial, and keep telling themselves that search hasn't changed to help justify a withering existence.

But do not fret (too much anyway) all is not lost. I do believe there are a few definitive paths to success in the web search industry for new companies with the right idea-- but only those that come prepared with their eyes wide open, and a very realistic view of where search truly is today, and where the world of end-users is influencing it from here. Without an accurate view, let's be honest, they're pretty much dead.

As for Powerset, I have to believe they've embraced this exercise, but only time will tell. Let's see how they debut later this year.

Read the full story

Thursday, June 7, 2007

iSearch, uSearch

Just returned from a month in Asia and Australia (inset pix of Shanghai nightlife). Fascinating centers of innovation from Tokyo to Singapore to Sydney. All have some interesting twists to next-gen Web applications and search. But that's for another time, another post. Today it's time we revisit the death of federated search (aka metasearch, single search, etc.) as we know it, and share a glimpse of what the future holds for finally solving the very elusive problem of getting at all of our information as easily as we should be able to. Ok, just a moment...ok, yep just checked, and it's still dead, dead as a doornail.

My friends and colleagues at Stanford University library and info-sciences department have been researching this problem head-on with over 700 databases of searchable research and academic content at their disposal. They are not alone. Countless universities, companies, and web services at large have found themselves at the end of the same dead end road.

Single search doesn't work, nor does traditional metasearch, or any other twists on federated search. Clustering metasearched results from multiple sources into artificial categories or groups only exacerbates the problem, and thoroughly confuses the end user. (sidebar: clustering has a very long way to go before it is anywhere near ready for prime time public consumption. Until then, it has no business being in search) These approaches have, in fact, proven pointless and only further delay any attempts to arrive at an acceptable user experience for effectively accessing a multitude of content sources simultaneously. I would go as far as saying that these so called solutions are robbing these important customers of their youth. Costing not just hundreds of thousands in license fees, but years of setbacks and distractions dealing with totally ineffective solutions. Everybody seems to have an angle that to me is nothing short of amusing. You'll notice through that link several spins on the same broken solution. I reviewed everything listed in those results. RIP.

There is a reason why basic search remains so widely popular, effective, and accepted by the vast majority of info seekers. Because it works. Because it is simple and intuitive. People get it. What people don't get are kludgey attempts to mash a bunch of square pegs into a round hole. If you look at the quality of search results from any of the tens or hundreds of enterprise search vendors, metasearch peddlers, and then say, Google, what you'll find might surprise you. Or maybe it won't. Yes, obviously Google.com works, and Google Search Appliance is no different. GSA stuck to its roots from Google.com for a reason: simple and intuitive user experience and high quality results-- from ONE source. Today GSA can crawl and index virtually any type of info object or database in existence. Why bother promoting new content in separate databases? This only adds to the problem. And with Google OneBox, we go even further, wiring competing content management systems to a better Google-controlled search experience.

So just what am I getting at? No, I'm not pimping Google's 'wares, but I am using them as one of only a few early examples of how to correctly begin to approach this problem. The answer is simple. One source. One index. One search interface. The fact that 700 databases sit in front of the info seeker is the real problem. There is no cohesive data model to support any meaningful metasearch whatsoever. "Normalizing" the boolean structure of the query language for each source's retrieval method was thought to 'standardized' the results that come back from all these random content sources. Not so. For it is not the query that matters, rather it is how the content is indexed. Just because the genre or subject nature of two content databases appears to be 'related' does not imply that the returned results will be the best combination of the two sources. Why? Because they have completely independent relational structures, metadata schemas, and ontologies.

Federated search, as we knew it before it died, did nothing more than mask this problem with a bland search interface wrapped around a broken and discontinuous distributed data model. Despite the cold reality, many of you still employ this type of solution at an increasingly expensive cost to your company and to your users' productivity.

But let's get back to the answer. Google introduced Universal Search, after quietly testing the concept under an alias website: searchmash.com. Yep, they really do. Universal Search is not there yet, but it is a move in the right direction. Yes, even Google faced a minor federation/metasearch problem as they continued to grow laterally into new content categories, e.g. News, Photos, Videos, Blogs, Products, Scholar, etc... As a result, it became increasingly unclear whether Google.com was the right place to start a search with so many alternate entry points that may be more appropriate for certain searches, e.g.: blogsearch.google.com, or news.google.com, and many more.

Universal Search is an early attempt to give the user a little taste of everything: pictures, videos, blogs, news, and web search results in one result page. Check out this basic example here for Steve Jobs. You get what I'm saying. Now, this doesn't exactly scale if you have 20, 30, or 700 types of content, or content sources to display on a page. They simply wouldn't fit. Additionally, Universal Search is more about displaying content of different types or formats versus merely different sources of content. For example, web pages, news articles, pictures, and videos are all very different types of content. I have designed two unique ways to address this problem, following some of the principles of Universal Search. Enter Integrated Search.

The integration of content sources is where we begin. The devil is most certainly in the details for this design and implementation, but here is the gist:

Recipe for Integrated Search

Ingredients

n parts of unique content sources
1 part really nice crawler/indexer (Nutch, GSA, or Lucene)
1 part high quality query interface with boolean translators, NLP, and auto completion and suggestion. (See CiteSeer or ACM for several)

Frappé all ingredients until smooth. Let stand and cool for 10 minutes.
Season to taste with one or both of the following:

1 search index inverter (yes, the secret sauce)
A dash of user intent interpolation at the point of query


This solves 3 problems at once. A single index, so that no sources need be considered at query time, ever. Smart pre-query processing to help guide the search query to match the users' intent. (We'll discuss intent-driven searches, or lack thereof, in an upcoming post.) And a powerful index/ranker to ensure that every content object in the index, from every original source is uniformly considered when ordering and displaying the results that best match the query.

This is NOT the case with traditional federators, which do nothing more than combine search results from hundreds of different indexing methodologies, with absolutely no way to 'honestly' or intelligently rank and order results that come from different indexers and ranking algos.

So even without revealing the secret sauce, you can see how this approach is fast, simple, and aligned with traditional search user experiences. The hard part? Crawling all the content sources means writing system adapters to content to the weirdest of old school flat file DB's, obscure object databases, and a whole lot worse. But if you pick a good crawler or general search product, much of that hacking has been done for you, as with Google's Search Appliance and their 220+ adapters that work pretty well out of the Box, pun intended.

So about that secret sauce? Well with a good inference about the user's intent we can bias the search results to better cater to the user's objective. And as for index inverting, its really about inverting the results that come from the index, for a given query. Ever curious what results actually appear at the end of a big web search with 5,400,000 results? How about dead middle of those 5.4 mil? Curious aren't we? Yes, it's all about discovery, and those deeper results can more useful that you might think.

As screen real estate continues to increase on the desktop/laptop, we'll no doubt continue to see search results get 'fatter' as in wider across the page. Yes, two and three column search results are on the way. And wait till you see where the ads turn up. For search its just the beginning. For federated search, well maybe we'll call it a new beginning. But for them, this means starting over. Completely.

So far I've yet to see any legitimate newcomers enter the arena to take up this challenge/opportunity head-on. In the meantime, partial solutions are manifesting within Web search while Google, Yahoo, and Ask continue to advance some good ideas in this arena. Yes, even Ask has been doing 'Unified' Search on their home page for a while now, and it's actually a reasonably clean UI...try out this query: iPhone be sure to stretch your browser as wide as it will go...not bad.

Integrated Search, iSearch. Coming to a theater near you? We'll soon find out...

Read the full story

Tuesday, April 3, 2007

Personalized Google Mashups - On The Fly

If you haven't used JSON, you're missing out. If you haven't heard of it, your just out of it period. JSON is a great data interchange format, that Google utilizes to streamline their first mashup wizard for Google Maps. It's a simple alternative to coding (certain) server-side proxy's for http requests to get to data in the form of JSON feeds. JSON liberated this extremely cool mashup wizard at Google a few days ago. Zero coding required to build very useful Google maps mashups of your own from your own Google Spreadsheet table. Reminds me of XQuery's thin client-side data extraction properties. Not surprising. Hmmm...XQuery for JSON...we could really be on to something. At any rate, for this example, you have to get your data into Google's Spreadsheet first, but that's far simpler that coding a mashup from scratch. This is the power of great front-side middleware, making custom app building truly user friendly. An excellent step forward that will no doubt unleash a new bevy of corporate, personal, and startup mashups. My first mashup to follow...

Read the full story

Federated Search is dead, dead, dead...

I've been asked to write about this for some time, and that time has finally come. O'Reilly touched on the subject recently discussing Google's plans in this arena. But white papers on the subject are not required to explain why traditional approaches to this dilemma are toast. In this post I'll explain why federation is broken and how corporations, universities, and start-ups continue to throw $ at the wrong end of the problem...click below to dive in!

Federation defined
Federated search is the art of attempting to execute a single keyword search across n number of databases, content sources, indexes, news feeds, etc. This is also known as metasearch, deep-web search, and content aggregation. No central federated index is maintained and no crawling or spidering required. The idea, in theory anyway, is certainly a convenient one. At Groxis, 90% of our customers were most interested in federating their enterprise content sources. In large companies and universities alike the sea of available content silos for any given organization is vast. It is not uncommon to find hundreds, even thousands of content sources used across a single organization.

Federation is all about passing the user's search query separately to each of those search engines, and collecting x number of search results from each source, and then figuring out how to display them to the user in some meaningful, useful, actionable way. This display challenge occurs because typically many of the content sources are not crawlable or spiderable due to licensing issues, ownership of the content, or because the content resides in a data store that is not crawable. Examples: SQL databases, proprietary content management systems, and commercial content such Lexis, Factiva, Reuters, etc. Federated search is a quick and dirty way to scan across a vast array of content sources.

Federated challenges
However, there is a fundamental usability and search logic problem with today's generic search federation. Let's presume you are federating just 20 content sources into a single search query interface. Using a traditional search results display format, 10 results per page, we arrive at problem #1: results ordering and display.

  1. What happens if all 20 sources return an average of 60 results for a given query? How are the results combined and displayed intelligently? The first ten results have only a chance of display at best, 50% of the breadth of the corpus. From a usability standpoint, federation demands a results display that best accommodates breadth and depth simultaneously.
  2. Each result set from each source uses its own unique 'relevance' ranking algorithm. Once you have the ordered result set from each source, how do you compare and order the combined results across different data sources?
Arbitrary solutions (aka common hacks)
A. Should we apply a weighting alogithm to each of the sources to favor more 'important' sources. Sure we could. But this arbitrary not contextual, and thus totally inefficient.

B. Should we apply speed? First results to come back get displayed first? Hardly contextual, hardly consistent nor sufficient. A poor man's federation to be sure. More on performance issues later.

C. How about ordering all the results into topic clusters? Sounds great, this allows us to organize all the results from all of our 20 sources into a cluster map, organized by topics, not content sources. On the surface this could indeed address some of federations shortcomings. However the problem is that topic clustering technology is woefully inadequate for serious research or just serious federation. I've reviewed, licensed, and tested every serious clustering engine in development, and even hacked together my own clustering algorithms over the past several years. They all have a common problem: They require optimization and customization to each and every content source, and never work consistently enough to overcome mass user adoption. They require unique stop word lists, phrase delineations, dictionaries, cluster tuning, label tuning, and a host of other tweaks. I could go deep here, but let's not get off topic. In fact, wait for my next posting that illustrates why document clustering is also dead, dead, dead.

I mentioned speed above. The other big usability problem is the speed at which each source returns results. Often times we cannot produce a combined results set because the federation engine is waiting on sources to return with their results. Some sources can be woefully slow, causing totaly response times to take up to 3-5 minutes! Yes, I've seen this in production at large enterprise sites. This is how to cream mass user adoption in about exactly... 3-5 minutes.

D. Another common 'solution' is to let the user pre-select the content sources from which to federate the keyword search. Sounds reasonable on the surface, until you have 20 or 700 data sources to choose from. Even grouping them together leaves too much to the imagination from a usability standpoint. User's aren't trained to 'think' about these intricacies, they just search and go. Advanced Search panes are rarely utilized correctly, if at all. Further, most users will know much less about each content source than the federation platform does. As such, having source selection choices is a massive burden on the user if there are more than 7-12 sources to choose from. In the end, this does not solve the problem, in most cases it adds to it.

The real solution - does one really exist?
Wouldn't it be nice if there were a simple, elegant solution to this most vexing problem? Librarians, universities, researchers, and knowledge enterprises would rejoice with a resounding thunder! And the company with the solution would similarly rejoice in the prying open of even the tightest purse strings of customers vying to get their hands on the proven solution once and for all.

Well, there is good news and bad news. The good news, there is an obvious solution. The bad news is, that is really, really, really hard to do. The solution: index everything. (Note: this is not the same as metasearch, which only aggregates results from separate search engines, as metasearch has no indexing capability...for now ;) One index one result set for all content online. Yes, I said it. If literally every content source were opened up to be crawled and indexed without prejudice, a single, uniform index could go to work providing users the most useful results from a single search. Sort of like removing DRM from digital music, in a way, I suppose. Let the content be free! The difference being, premium content publishers would not have to open up the body of the content to the end user. Just look at how Yahoo and others handle searching premium content. You can access the metadata (title, author, abstract, summary, etc.) and then you pay to gain access to the full text. [ Paying for content is yet another topic all together. Yes, it too is dead, dead, dead...] Yahoo's subscription content federation is an example of the "index everything" solution on a much smaller scale. Though this implementation is only partially effective here and for only small groups of content sources.

In theory a web index such as Google.com, is a federated index of sorts, culling together millions of small and large 'content sources' known as websites into a centralized search index. Fundamentally no different from metasearching, but architecturally and contextually vastly different user experiences and effectiveness.

The bad news is also obvious. It is seemingly impossible to get all content sources opened up, and indexed anytime soon. Not to mention the privacy, copyright, formatting, and global policy issues that surround the notion. Just look at all the flak Google gets for scanning books in a library. Given this, might there be another way? Another approach that achieves maximum usability, and extracts maximum value from any cross section of content sources for the user, the researcher, the knowledge worker? I believe there is such a solution in development today. For hints as to the direction of such an approach, let me point you to a few successful 'mini-federators' in the web2.0 world that are really effective.
  • Take a look at: Original Signal (look beyond their new blog style home page to the 'channels' of aggregated content) - a simple example to be sure, and nothing breakthrough per say with the user experience. Rather effective just the same.
  • Take a look at the approach taken by Yahoo, and improved by Google with the 'personalized' home pages that allow you to customize your content aggregation into an RSS + Ajax dashboard of sorts. There are scores of web2.0 RSS aggregators and some really clever dashboards out there, that are planting the seeds for something much bigger.
  • But those are what I call the 'lay-ups' or the obvious choices. Less obvious but closer to what the future holds for federation include: Google CSE and Yahoo Pipes -- think social computing meets vertical search while killing metasearch...
The current design of the 'dashboard' as we know it does not scale to support n number of content sources, and certainly not 700 or 1000, but CSE and Pipes are a very different story. Essentially do-it-yourself federated indexes as I described earlier. Very high potential. Particularly as screen real-estate runs at and all time premium today. As such, if we are to arrive at true front-end solution versus a back-end (index everything) solution, it has to scale. It must also remain simple, efficient, and require an almost zero learning curve. New visual metaphors have only exacerbated the adoption and usability problems that plague most federation solutions in the market today.

If search federation sounds like a rather elusive problem to solve, I can promise you, elusive is an understatement. The answer lies in how we interact with, process, and digest information instinctively, not 'intuitively' as most info designers would have you believe. Intuition-driven approaches only lead to new products and solutions that chase their own tail, never really solving the problem at hand. We have seen, and will see yet more companies come and go with their valiant attempts to crack the code for federated search. But until the real problems with federation are truly understood, be prepared for more tail wagging. Ironically enough however, it appears to me a solution might soon be launched...right under our noses...hmmm. As always, stay tuned!

Read the full story

Monday, March 5, 2007

Yahoo's Response: The "Real" Declaration

Following my earlier post regarding Yahoo's recent restructuring efforts, Yahoo responds. The story was first illuminated by TechCrunch's posting of an internal email from their CFO. As it turns out, Yahoo wasn't too pleased with their posting-- surprise, surprise. Nonetheless, Yahoo management was quick, and kind enough to respond to my analysis of the posting here at Sixorg. In fact, I was surprised to hear from quite of few of my friends at Yahoo, all eager to get the story updated. So here's what we now know...

In my posting I stated that Tim Cadogan was heading up core search. Unfortunately that is not the case. Tim is instead heading up Search and Listings Marketplaces, which is essentially search advertising, not core search. Core Search as it turns out, resides in separate organization called the Audience Group under the leadership of Jeff Weiner as part of his Network division. "The Network division is now comprised of five areas including Search, Community & Communications, Front Doors, News & Information and Entertainment. We believe this new structure will allow us to better align our strategy with the organization and deliver on its mission to "connect people to their passions, their communities and the
world's knowledge, " states a Yahoo! spokesperson.

Jeff's been in various M&A, and search roles since Terry Semel brought him over from his Hollywood venture investment firm, Windsor Digital, where Jeff was doing M&A work for Terry's deals. Jeff is a good guy, and well liked by some hard-core senior search gurus inside Yahoo that I know personally. Yet some folks there tell me they find it curious how Jeff was nominated to run a multi-billion dollar core search group, with no search background prior to Yahoo. Remember, in the Valley, it's not always what you know, it's who you know.

Eckart Walther and Andrew Braccia head up Search within the Network division, which is a good thing. Eckart is sharp, and the real deal in core search. I enjoyed engaging with Eckart on core search innovation throughout the partnership between our companies.

Yahoo also confirms that all teams dedicated to Panama, Yahoo's new search advertising platform, are housed under Sue Decker.

The only real problem I see with this emerging structure is NOT the suggested peanut butter being spread around the company. In fact, a company of Yahoo's scale and global reach requires at least this much infrastructure and yes, bureaucracy, to continue to scale and drive growth efficiently. Yahoo faces a similar problem that many new entrants into the search and search advertising business face today. It's the chicken and egg dilemma. You'll notice that Yahoo has organized itself into a pair of Supply and Demand groups, to become better market makers for the advertising business.

But what good is an rich network of advertisers without premium inventory across the web publishers' real estate on which to run it? And conversely, what good is a rich publisher base, if the ads and ad serving technology can't stand up very well against the competition? Tim's group inside Yahoo is perhaps the most vital in this equation. However, Yahoo's creation of three peer groups in APG: Supply, Demand, and Products may very well create more challenges down the road. The "magic in the middle" as Yahoo describes it grossly underestimates the role that Panama and other core technology must play in Yahoo's latest competitive bid. It is far more germane to Yahoo's growth, than simply the glue between APG's supply and demand. Just ask Wall Street:

All of Yahoo's recent stock activity is based not on new divisions, roles, or titles, but solely on the promise of Panama, a content matching technology. Mark Morrissey has been promoted to SVP of APG Product Management, and as such plays a key role in clarifying this company wide. This is one thing that Google has not lost sight of, even amidst their mesmerizing growth.

At a bare minimum, a dotted line to core search is a must (but we all know that would just be cheating). I just can't rationalize keeping core search and core ad serving technologies in two very different parts of the organization, because of their revenue generating power. The technologies are very interdependent, and strategically linked to driving web traffic, click-throughs, and loyalty (trusted search). Remember, keyword searching isn't the only way we search anymore.

Let us not forget, the true 'Audience' as Yahoo puts it, is the end user, the web surfer, the web searcher. And this Audience is a key ingredient to the supply and demand channels for both publishers and advertisers.

Read the full story

Thursday, February 15, 2007

Yahoo's Declaration of Dependence

As Mike puts it today on TechCrunch, a lot of peanut butter bein' spread around Yahoo again with Sue Decker's email sent out company-wide this morning. I had dinner with Sue a couple years ago, and ultimately completed a few strategic deals between Yahoo and Groxis, the company I co-founded and ran for the past five or so years. What I can tell you is this. Sue is a smart, driven executive. No doubt she'll get the operational job done better than anyone at Yahoo today. However, I question a few of the organizational moves she's outlined in this 2007 manifesto. Perhaps the most glaring and most critical to Yahoo is where their core search group gets parked inside the organization. According to the email, core search is now part of "Marketing Products" which I find rather curious.

The three most important rules for any startup company in the Valley are: focus, focus, focus.

The good news is that Tim Cadogan has been tapped and promoted to head up the core search organization. Tim is a very smart, capable, and cool guy, with whom I structured our 'landmark' deal with Yahoo. Landmark in the sense that we were one of the first search companies to get Yahoo and Google to play nice in the same small sandbox that was my startup. Anyway, Tim is unquestionably the right guy for the job, and needs his group to be elevated to have a greater effect on how the core search is being leveraged not as a marketing product, but as a supply-side horizontal revenue generator across the company. The new structure could put the company at risk by layering it vertically into the organization. If you look at Google, core search is far bigger, and far more horizontally aligned across the company. And for good reason. It (along with AdWords) is their bread and butter. It used to be Yahoo's bread and butter too.

I can't help but think that Yahoo's acquisition spree has precipitated this latest move. With so many seemingly random acquisitions to keep pace with competition, the company has had to react vs. act. This leads to a contrived strategy based on the new assets they have to deal with and leverage effectively to please Wall Street (read: justify the acquisitions). Whereas you'll notice Google to be slightly more proactive, with a seemingly tight-lipped strategically planned growth strategy. We'll be watching closing to see how this new plan plays out for Yahoo. Don't lose sight of the importance of core search, Sue, Yahoo depends on it. And remember, it's never too late to focus, focus, focus.
Read the full story

Monday, February 12, 2007

Tag, I'm It!

Well I can't think of a better catalyst to get my blog project off the ground, than receiving a Blog- Tag from Chris Shipley, of DEMO fame. I'm still waiting on the domain name of choice to come through, but I thought to myself, why wait? For those out of the know, the idea behind Pulver's Blog-Tag is pretty simple. When you get tagged, you have to share five things most people probably didn't know about you, on your blog of course (yes, must have blog to play), then tag five others. From the looks of it, I'm in good company with Chris, and her tagged five. Check 'em out yourself. In honor of the brand, I'm modifying the rules to share six little known facts about me to introduce Sixorg. Read on...

1. As for Sixorg, well its been a long time coming. After years of badgering from my friends like Ross Mayfield at Socialtext, Giovanni Rodriguez - PR dynamo-guru-extraordinaire, and Dave Sifry over at Technorati way back in 2004, I've finally pulled it together to get my vocals online. Mostly so these guys and the rest of my crew don't have to listen to me rant and rave about the tech industry in person. Now they can deal with me in doses, or avoid me all together.

Tag, tag, and tag, you three are it!

2. Watch this space for several interesting things to come. I'm particularly interested in challenging conventional wisdom and conventional commentary on a range of hardcore topics including web search, user experience, product design, startups, and the truths, lies, and videotapes of the venture capital game. When I say hardcore, I mean sans the fluff, hype, and Kool-Aid that inebriates even the highest profile industry websites and blogs in action today.

3. After building seven startups in the Valley, its time I join fray to separate the industry truths from the naive, the misinformed, and from those that haven't been there nor done that. Man, if those walls could only talk. There's a juicy best-seller in there just waiting to break free...someday.

I'm also inspired by those that have paved the way into this game, including John Battelle, good friend to Groxis back in the day, and Matt Marshall from the Merc and now VentureBeat.

John, Matt, tag, tag you're both it!

4. I am a closet architect (read: not licensed nor formally trained). I've designed 3 buildings that have actually gone into production as I like to say. One of which I live in today.

5. Huge fan of The Police. Reunited for a 30th anniversary tour in 2007. My Sting stories to follow, later in the year.

6. I had a really inspiring conversation today with the founder of Linden Lab, Philip Rosedale, aka Philip Linden in Second Life. There is definitely way more to this story, and to Philip, than most people realize. More on this later in Sixorg, once I turn up my Second Life account.

A real-time 'tag you're it' Philip!

So there you have it. Six little known facts, six tags, and Sixorg. Chris Shipley, I owe you next round at the pub, it's all your fault.

Welcome to the show, let's get on to the good stuff. Grok 'n roll.

Cheers,

-- R.J.

Read the full story