7 Questions for Collecta

Collecta is the latest creation from Jack Moffitt, a pioneering Albuquerque-based open source developer who has spearheaded the migration of real-time experiences on the Web with XMPP.  The service has etched out a place for "curated search", harvesting items across the Internet, displaying them mere moments after they're published online - without latency.

He's also authored the new book "Professional XMPP Programming with JavaScript and jQuery", and has given numerous talks about some of the opportunities and challenges inherent to managing real-time search.  Here's 7 Questions with Jack.

1. One of the most impressive features of Collecta is its ability to harness updates not only from the usual suspect microblogging platforms like Twitter and Identi.ca, but also from news sources and blog posts/comments, as well as through Flickr for images and video.  Describe how your back-end works. 

The main difference between the back-end at Collecta and the ones of most other search systems is that Collecta is almost entirely push-based. For example, instead of a crawler (which visits web pages on a periodic basis), a publisher notifies us about content at the same time that they publish it to their own sites.

When Reuters sends a new story out to their own sites, they also send it directly to us, and that content shows up immediately to any matching searches.

Of course, not all publishers are ready for push notifications, but we have also built specialized polling systems that bootstrap those data sources. We've also worked with the number of publishers to develop and deploy their push systems. For example, we helped Wordpress.com develop their XMPP infrastructure, which makes all Wordpress.com-hosted blogs and blog comments available as a publish/subscribe stream.

The primary advantage of this design is that the latency between the time someone publishes content and the time we see it is very nearly zero. We typically measure latencies of a third of a second or less on push sources.

2. What have some of the challenges been in managing so much inbound data?

The biggest challenge is getting access to the data. Since few publishers have working push systems today, we must work with them to develop such systems. Generally they are quite eager to do so, as they know that they need to provide their data ever faster in today's world. We may be the first to be knocking on these doors, but it is easy to see that there will be a lot of others after us wanting this data.

The next challenge is processing the data as fast as possible. In traditional search, relevancy is improved over time. For example, it may take several weeks for a new Wikipedia article to get enough incoming links to start showing up as very relevant. In real-time search, we can't afford to measure relevancy this way. One side-effect of this is that all the algorithms involved must be well tuned so that they don't introduce any delays. There is no buffer zone for processing.

3. I see Collecta being valuable to people in three distinct types of situations: for scheduled events with global importance that people can anticipate and plan for (i.e., Christmas, the Super Bowl, the Emmy Awards); for breaking/developing items that people can't predict or didn't expect (i.e., celebrity deaths, national tragedies, etc.); and also for the general entertainment value in seeing things pop-up as they happen.  How have people told you they've used your system?

These categories match-up well to several use cases we've heard about. Here are some examples:

One of our developers used Collecta to find information on local road conditions that was unavailable through other means. This got his wife to her destination on time by alerting her to closed roads and flooded areas.

We've heard stories of people researching items they were thinking of purchasing and finding deals and coupon information that made their purchase cheaper. Imagine that you're thinking about buying my book, and you search for it on Collecta. Someone may have just been talking about how they just bought it at some specialty site for a discount. You can then take advantage of the same deal.

My in-laws are farmers in rural Minnesota. They used Collecta to find out about issues affecting the grain markets that wasn't available in their traditional news sources for several days. This allowed them to structure their financial transactions to accommodate this. You could imagine similar stories playing out for any market trader.

One of our internal use cases is watching how people respond to our own announcements and features. When we launched Collecta we were all glued to it, and there was no other way to watch how people reacted to our product. Before we launched, we used Collecta to monitor President Barack Obama's inauguration, and the photo stream in particular was stunning and much more interesting than what you could get from TV.

I think you've left out one big category, though: discovery. We call this thing real-time search, but the name is really inadequate. It's the closest analogy we could that would give potential users some idea of what to expect. Once you've aggregated
all the data, you have some real possibilities that aren't really search-related in the traditional sense. For example, when you visit Collecta.com lately, the page is filled with current, trending topics, all with the most relevant articles attached.

This category is more like the function of newspapers and nightly news. It gives you an overview of what is going on in the world, even if you aren't sure what you're looking for.

We also plan to expand this to determine trending information and relevant summaries within any search you do.

4. One concern constantly popping up about real-time systems in regards to current events is the need to group data based on the authoritativeness, a la Google News, wherein more credible resources are assigned a greater weight based on the assumption that they're more accurate. Seeing as how reporting results as they happen is naturally linear, what are your thoughts?

In general, I think authority is a tricky subject. Most of the authority based relevance systems are pretty bad; they often apply authority over a broad range of topics when a source may only be authoritative for a small set. This is not a limitation of technology,
though. The same situation happens in the real world. Who is the authoritative source for information on healthcare reform in the U.S.? I think you'll find many people can't agree on this.

The way this is dealt with is by applying time. Eventually some source or set of sources will be deemed authoritative and this becomes the history we read about. The time scale here is quite large though, easily decades in some cases.

I don't think you can do a great job on authority alone, and I don't think it applies as broadly as people think. Finally, I don't see how you can resolve authority issues in a fraction of a second, when the only method I know of requires significant amounts of time.

The case may be different for a particular knowledge domain, but I still think even in these cases, authority is relative to a worldview.

5. Seeing as how we're in the embryonic stage of working with real-time information at a consumer level, many of the first-generation services are based on search and filtering of large data sets.  What other types of apps do you predict being popular as this model continues to evolve?

We're very interested in the idea of participation. Once you can inject your own content into the real-time flow, the world becomes a conversation, not a search. An example I often use is that the television network ABC would have a tough time getting all the people in the world to a single place to chat about its show Lost (not to mention getting them there at the same time). However, the Collecta search engine can aggregate every conversation about Lost in a single place. If you imagine a chat room interface where the other participants are made of up incoming search results, you can see where I'm going. You now have an Internet-wide chat room on any topic whatsoever, filtered and customized however you want. 

Data aggregation is not easy, and I think it will enable a host of applications that depend upon access to large amounts of data. For example, currently TweetDeck and Brizzly and the other social media clients all implement the Twitter API, the Facebook API, etc to get data to their users. You can see this clearly in their UIs as well; each service is boxed by itself. Imagine how easy it might be to make an integrated social media client when access to the data is uniform. This is just a simple example, but there are more complex situations that data aggregation can enable.

6. Like any emerging technology, some enterprising folks will develop frameworks to abstract/automate a lot of the tedious minutiae of working with XMPP over HTTP - you've released Strophe both as a C and JavaScript library.  How do you see this maturing in the near future?

Eventually we will reach a point where web browsers don't need long polling and other similar hacks to do push-based or two-way communication. I don't know if this will be realized in the form of XMPP support in browsers or whether it will be in the form of the
HTML 5 WebSockets protocol or something entirely different.  On the client side, libraries like Strophe already make a lot of the tedium of the process invisible.

The server-side is just as important though. XMPP and WebSockets require more than just a simple web server. I think we will see growing sophistication in that space to match what we've seen with web application servers like Django, Ruby on Rails, and others.  Soon we will have XMPP application servers and WebSocket application servers, and the frameworks like Rails and Django will seem as quaint as server-side includes and CGI scripts.

7. It would seem logical that you're going to get a lot of requests for custom app development based on the real-time paradigm for industries like media, politics and sports.  Have you, to this point?  (At the TV news station I work for, I'm planning to have Collecta running non-stop in our lobby on a large plasma HDTV as eye candy for visitors, with posts relative to the local current events scene.)

Yes. As I said earlier, we talk to many of these publishers about getting them set up to push content instead of just making it available via pull methods. It doesn't take long for these publishers to turn the questions around and start asking us how they can leverage real-time in their own properties.

Sometimes they want easy, aggregated access to their own content, and sometimes they are intrigued by the possibility of integrating a view on everything else on the web.

We're developing several products that will make this easy for these publishers and for anyone else to build what they want. The first such products we launched were the XMPP API and HTTP API for Collecta search results. I can't name any names, but we give out API keys to well-known publishers all the time.

One of the common requests is for "curated search" products. The publishers want to control the queries that happen and just provide an updating view, or they want to restrict the input sources to a specific knowledge domain. You can imagine a sports version of Collecta where both the queries and the data sources were restricted or augmented to facilitate a focused experience. This is pretty much everyone's first application idea with our technology, and it's one we are really excited about facilitating.

You can see one of our early prototypes in this area, which we designed for the Obama inauguration. Our recent launch of MySpace site search is another example. The former is limited to a specific query on the full data set, and the latter is scoped to a specific data set but allows arbitrary queries.

There is a lot of unexplored ground here, which is what makes the work extremely fun.

Thanks Jack!  Great feedback and good luck with your work in the real-time search space! :)

ARCHIVE