The real time web is getting hotter with every day. Web properties like Twitter and Facebook are leading the charge. While credit has to be given to these companies the infrastructure for real time has been quietly building momentum over the past 20 years. Things like GPS, Doppler Radar, design patterns, black board systems, video on demand, networking, multi-threaded applications, sensors and multiple core processors are really at the heart of this overnight revelation.
Real Time Search
Searching the real time Web is quickly becoming a big issue. If we look to the leaders in the space, Googles Matt Cutts for example talks about Google’s approach to twitter in the video below.
People are starting to wonder if the concepts of Universal web search can be extended to incorporate the signals of the real time web? Before diving into that question, let’s review: what real time actual means, some of the signals being exposed by the real time Web, and how they might be captured.
Real time computing is not about processing things that happen now, instead it is about operational deadlines and predictability. This means something that changes once an hour if that is its defined deadline can be considered real time. Not what you expected I bet, but how often things change is an interesting angle to look at the real time Web. The graph below breaks real time web data/signals into 3 categories and maps whether the data should be captured implicitly or explicitly.
Part 1 of 2.
Traditional web search engines maintain duplicate copy’s of the internet. They accomplish this by deploying armies of spiders to attack websites at regular intervals. These attacks consume your websites resources.
Techniques and algorithms can be used to minimize load and play fair, but as a site owner you have to decide how much of today’s resources you are willing to allocate to support a possible request tomorrow.
This is where real time search is different; services like OneRiot do not keep replicas of the complete Internet around just in case someone asks. They do not even crawl your website instead they fetch specific pages people think are interesting now. This means the resource load on your website is tiny and directly tied to real traffic patterns.
Grouping real time search with traditional web search does not make sense and it actually minimizes your websites ability to get traffic today. I would like to propose a new identification scheme, one that clearly identifies the type of service accessing your website. This way website owners can throttle with knowledge rather lumping all search engines into a single bucket.