Friday, April 27, 2007

The Graphical Information Interface - Made Simple

We pioneered them, and coined the 'GII' at Groxis back in 2001. The significance of new interfaces to information transcends that of any one company and any one particular problem to be solved. The sheer mass of information that we now process on any given day is orders of magnitude greater than ever before. As such, our means by which to effectively absorb, leverage, and filter information flows (or floods) must also evolve. Enter the GII. We created Grokker with the idea that information needed to be liberated from the confines of HTML and the web browser as we knew it years ago. It was our "1.0" attempt to create a universal information currency with the capacity to convey far more usefulness than a list of 10 search results on a web page. Grokker sparked a small but potent movement along these lines, paving the way for new approaches to opening the information bottleneck at the point of consumption. Many great advances have emerged, showing promise for the future of information experiences to come.

Sixorg is all about information, search, and the like. As such, you can expect to find many cool, new examples of the GII that fit the bill in future posts. Today I'm sharing a 'blog find' that ranks high on the list for one of my good friends, and is now high on my list. It's called Indexed by Jessica Hagy. In a word, brilliant. A witty and creative smattering of hand drawn Venn diagrams, scatter graphs, and more, representing mathematical theorem proofs of every day life- infoviz style. The clever and entertaining diagrams and graphs convey information in spades- further proving another theorem that a picture is worth a thousands words. Indexed is a simple but great example of GII's in action. There is no reason to explain it further, because the examples so clearly speak for themselves. 'Nuff said.

Take a look, bookmark Indexed. It's fresh air for your creative mind. Thanks to my friend and colleague Rosie for the link. Rosie doesn't have a blog, but should. Girl's got skills. Maybe this will kick her in gear to get her game on! More to come. Stay tuned...

Read the full story

Tuesday, April 3, 2007

Personalized Google Mashups - On The Fly

If you haven't used JSON, you're missing out. If you haven't heard of it, your just out of it period. JSON is a great data interchange format, that Google utilizes to streamline their first mashup wizard for Google Maps. It's a simple alternative to coding (certain) server-side proxy's for http requests to get to data in the form of JSON feeds. JSON liberated this extremely cool mashup wizard at Google a few days ago. Zero coding required to build very useful Google maps mashups of your own from your own Google Spreadsheet table. Reminds me of XQuery's thin client-side data extraction properties. Not surprising. Hmmm...XQuery for JSON...we could really be on to something. At any rate, for this example, you have to get your data into Google's Spreadsheet first, but that's far simpler that coding a mashup from scratch. This is the power of great front-side middleware, making custom app building truly user friendly. An excellent step forward that will no doubt unleash a new bevy of corporate, personal, and startup mashups. My first mashup to follow...

Read the full story

Federated Search is dead, dead, dead...

I've been asked to write about this for some time, and that time has finally come. O'Reilly touched on the subject recently discussing Google's plans in this arena. But white papers on the subject are not required to explain why traditional approaches to this dilemma are toast. In this post I'll explain why federation is broken and how corporations, universities, and start-ups continue to throw $ at the wrong end of the problem...click below to dive in!

Federation defined
Federated search is the art of attempting to execute a single keyword search across n number of databases, content sources, indexes, news feeds, etc. This is also known as metasearch, deep-web search, and content aggregation. No central federated index is maintained and no crawling or spidering required. The idea, in theory anyway, is certainly a convenient one. At Groxis, 90% of our customers were most interested in federating their enterprise content sources. In large companies and universities alike the sea of available content silos for any given organization is vast. It is not uncommon to find hundreds, even thousands of content sources used across a single organization.

Federation is all about passing the user's search query separately to each of those search engines, and collecting x number of search results from each source, and then figuring out how to display them to the user in some meaningful, useful, actionable way. This display challenge occurs because typically many of the content sources are not crawlable or spiderable due to licensing issues, ownership of the content, or because the content resides in a data store that is not crawable. Examples: SQL databases, proprietary content management systems, and commercial content such Lexis, Factiva, Reuters, etc. Federated search is a quick and dirty way to scan across a vast array of content sources.

Federated challenges
However, there is a fundamental usability and search logic problem with today's generic search federation. Let's presume you are federating just 20 content sources into a single search query interface. Using a traditional search results display format, 10 results per page, we arrive at problem #1: results ordering and display.

  1. What happens if all 20 sources return an average of 60 results for a given query? How are the results combined and displayed intelligently? The first ten results have only a chance of display at best, 50% of the breadth of the corpus. From a usability standpoint, federation demands a results display that best accommodates breadth and depth simultaneously.
  2. Each result set from each source uses its own unique 'relevance' ranking algorithm. Once you have the ordered result set from each source, how do you compare and order the combined results across different data sources?
Arbitrary solutions (aka common hacks)
A. Should we apply a weighting alogithm to each of the sources to favor more 'important' sources. Sure we could. But this arbitrary not contextual, and thus totally inefficient.

B. Should we apply speed? First results to come back get displayed first? Hardly contextual, hardly consistent nor sufficient. A poor man's federation to be sure. More on performance issues later.

C. How about ordering all the results into topic clusters? Sounds great, this allows us to organize all the results from all of our 20 sources into a cluster map, organized by topics, not content sources. On the surface this could indeed address some of federations shortcomings. However the problem is that topic clustering technology is woefully inadequate for serious research or just serious federation. I've reviewed, licensed, and tested every serious clustering engine in development, and even hacked together my own clustering algorithms over the past several years. They all have a common problem: They require optimization and customization to each and every content source, and never work consistently enough to overcome mass user adoption. They require unique stop word lists, phrase delineations, dictionaries, cluster tuning, label tuning, and a host of other tweaks. I could go deep here, but let's not get off topic. In fact, wait for my next posting that illustrates why document clustering is also dead, dead, dead.

I mentioned speed above. The other big usability problem is the speed at which each source returns results. Often times we cannot produce a combined results set because the federation engine is waiting on sources to return with their results. Some sources can be woefully slow, causing totaly response times to take up to 3-5 minutes! Yes, I've seen this in production at large enterprise sites. This is how to cream mass user adoption in about exactly... 3-5 minutes.

D. Another common 'solution' is to let the user pre-select the content sources from which to federate the keyword search. Sounds reasonable on the surface, until you have 20 or 700 data sources to choose from. Even grouping them together leaves too much to the imagination from a usability standpoint. User's aren't trained to 'think' about these intricacies, they just search and go. Advanced Search panes are rarely utilized correctly, if at all. Further, most users will know much less about each content source than the federation platform does. As such, having source selection choices is a massive burden on the user if there are more than 7-12 sources to choose from. In the end, this does not solve the problem, in most cases it adds to it.

The real solution - does one really exist?
Wouldn't it be nice if there were a simple, elegant solution to this most vexing problem? Librarians, universities, researchers, and knowledge enterprises would rejoice with a resounding thunder! And the company with the solution would similarly rejoice in the prying open of even the tightest purse strings of customers vying to get their hands on the proven solution once and for all.

Well, there is good news and bad news. The good news, there is an obvious solution. The bad news is, that is really, really, really hard to do. The solution: index everything. (Note: this is not the same as metasearch, which only aggregates results from separate search engines, as metasearch has no indexing capability...for now ;) One index one result set for all content online. Yes, I said it. If literally every content source were opened up to be crawled and indexed without prejudice, a single, uniform index could go to work providing users the most useful results from a single search. Sort of like removing DRM from digital music, in a way, I suppose. Let the content be free! The difference being, premium content publishers would not have to open up the body of the content to the end user. Just look at how Yahoo and others handle searching premium content. You can access the metadata (title, author, abstract, summary, etc.) and then you pay to gain access to the full text. [ Paying for content is yet another topic all together. Yes, it too is dead, dead, dead...] Yahoo's subscription content federation is an example of the "index everything" solution on a much smaller scale. Though this implementation is only partially effective here and for only small groups of content sources.

In theory a web index such as Google.com, is a federated index of sorts, culling together millions of small and large 'content sources' known as websites into a centralized search index. Fundamentally no different from metasearching, but architecturally and contextually vastly different user experiences and effectiveness.

The bad news is also obvious. It is seemingly impossible to get all content sources opened up, and indexed anytime soon. Not to mention the privacy, copyright, formatting, and global policy issues that surround the notion. Just look at all the flak Google gets for scanning books in a library. Given this, might there be another way? Another approach that achieves maximum usability, and extracts maximum value from any cross section of content sources for the user, the researcher, the knowledge worker? I believe there is such a solution in development today. For hints as to the direction of such an approach, let me point you to a few successful 'mini-federators' in the web2.0 world that are really effective.
  • Take a look at: Original Signal (look beyond their new blog style home page to the 'channels' of aggregated content) - a simple example to be sure, and nothing breakthrough per say with the user experience. Rather effective just the same.
  • Take a look at the approach taken by Yahoo, and improved by Google with the 'personalized' home pages that allow you to customize your content aggregation into an RSS + Ajax dashboard of sorts. There are scores of web2.0 RSS aggregators and some really clever dashboards out there, that are planting the seeds for something much bigger.
  • But those are what I call the 'lay-ups' or the obvious choices. Less obvious but closer to what the future holds for federation include: Google CSE and Yahoo Pipes -- think social computing meets vertical search while killing metasearch...
The current design of the 'dashboard' as we know it does not scale to support n number of content sources, and certainly not 700 or 1000, but CSE and Pipes are a very different story. Essentially do-it-yourself federated indexes as I described earlier. Very high potential. Particularly as screen real-estate runs at and all time premium today. As such, if we are to arrive at true front-end solution versus a back-end (index everything) solution, it has to scale. It must also remain simple, efficient, and require an almost zero learning curve. New visual metaphors have only exacerbated the adoption and usability problems that plague most federation solutions in the market today.

If search federation sounds like a rather elusive problem to solve, I can promise you, elusive is an understatement. The answer lies in how we interact with, process, and digest information instinctively, not 'intuitively' as most info designers would have you believe. Intuition-driven approaches only lead to new products and solutions that chase their own tail, never really solving the problem at hand. We have seen, and will see yet more companies come and go with their valiant attempts to crack the code for federated search. But until the real problems with federation are truly understood, be prepared for more tail wagging. Ironically enough however, it appears to me a solution might soon be launched...right under our noses...hmmm. As always, stay tuned!

Read the full story