Are Search Engine Spiders / Robots Abusing Your Server?

Search engines represent a huge chunk of where Photics.com visitor traffic originates. Someone in France, Poland or Bahrain will use a search engine (like Google, Yahoo or MSN) and they may find my website. That’s impressive to me. I wasn’t even sure where Bahrain was located. But through the power of search engines, my website was being viewed out there… AMAZING!

http://en.wikipedia.org/wiki/Bahrain

That’s why I allow search engine robots to index my site. I know that it will bring additional website traffic from visitors all over the world. At almost any time of the day, I can find the MSNBot Spider, the Yahoo! Slurp Spider or Googlebot “crawling” my website. Machines are going through my website, reading it systematically. The pages on my website are sorted by the search engines, judged for keyword search relevancy. Someone in the Czech Republic might be looking for Guild Wars content. A search engine might determine my website as a match. Almost magically, Photics.com has a new visitor.

The problem is that the search engine robots can get out of control. Here’s a statistic about Photics.com website traffic…

Most users ever online was 187, 05-03-2008 at 12:03 PM.

That statement is somewhat false for two reasons. The most users ever online was 701 in early January of 2007, thanks to a popular link on DIGG. However, there’s another reason why there’s a problem with the 187 number – search engine spiders. While the 187 number looks impressive, those aren’t real people visits.

The Photics Forum was relaunched, with new links and a new sitemap, Yahoo! Slurp started sucking up bandwidth like it was a chocolate milk shake. I found this activity amusing. But if it gets out of control, it could take down the server – similiar to a denial of service (DoS) attack.

What is a webmaster supposed to do? Should you block the spiders, but lose potential customers? Should you let the spiders roam free, but run the risk of them taking out your servers? If the server is robust, this may not be an issue. But if concurrent visitors and bandwidth limits are a concern, this could become a source of anxiety for webmasters. Fortunately, the answer doesn’t have to be all or nothing. There may be ways to control how the search engine spiders interact with your website.

The first thing to do is to create a robots.txt file and put it in the root directory of your website. With this file, you can set some rules for the robots.

User-agent: *
Disallow: /

If you’ve simply had enough with the robots, a command like that tells all of the search engine spiders to stay away. The star in the user-agent line means all agents. The slash after disallow means to include all files and folders from the root directory.

However, the goal should be to control the speed at which the robots crawl your website – assuming that the speed desired is not zero. Instead, here’s alternative information for the robots.txt file…

User-agent: msnbot
Crawl-delay: 60

That will tell the msnbot to slow down, limiting the visits to once every 60 seconds. The new instructions may not be acknowledged for a couple of days, so it may take some time to see results. Spiders are programmed to be sensitive to the speed of a webserver. They’re supposed to slow themselves down if the server is overwhelmed, but the robots may not be able to judge the appropriate download rate. The Crawl-delay command allows you to take some control. I use the word “some” because not all of the search engine spiders respect the crawl-delay command.

For example, Google ignores your crawl-delay request.
http://www.google.com/support/webmasters/bin/answer.py?answer=35239

Unfortunately, the robots.txt file may not be honored or some of the instructions might be ignored, but it is a good place to start. Yet, to gain additional control on how the search engines interact with your site, you might have to use their tools.

http://www.google.com/webmasters/tools/

Once you verify your site with Google, additional tools become unlocked. With access to the Google Webmaster Tools, you can slow down the Googlebot…

Dashboard > Tools > Set crawl rate

There are other interesting tools available through the Google Webmaster website. Yahoo also has their Site Explorer tool for webmasters, but I don’t think it’s as robust as the Google Webmaster Tools. Yahoo seems to honor the Crawl-delay command. So if Yahoo! Slurp is giving you trouble, the robots.txt file might be something to try.