Segmenting Spiders Might Help Control Them

Traditional web search engines maintain duplicate copy’s of the internet. They accomplish this by deploying armies of spiders to attack websites at regular intervals. These attacks consume your websites resources.

Techniques and algorithms can be used to minimize load and play fair, but as a site owner you have to decide how much of today’s resources you are willing to allocate to support a possible request tomorrow.

This is where real time search is different; services like OneRiot do not keep replicas of the complete Internet around just in case someone asks. They do not even crawl your website instead they fetch specific pages people think are interesting now. This means the resource load on your website is tiny and directly tied to real traffic patterns.

Grouping real time search with traditional web search does not make sense and it actually minimizes your websites ability to get traffic today. I would like to propose a new identification scheme, one that clearly identifies the type of service accessing your website. This way website owners can throttle with knowledge rather lumping all search engines into a single bucket.

Thoughts?


Follow

Get every new post delivered to your Inbox.

Join 2,204 other followers