Crawlers, robots and excessive bandwidth

bazzjazz · Apr 28, 2008

Not sure if this is the correct board for this...

I host a clients site on a shared server with 50gb bandwidth allowance per month. At the moment Googlebots are using the entire 50gb and more crawling the site. Total bandwidth for my shared server this month is about 130gb and most of this is robots crawling various sites.

I have a robots.txt file which is only allowing the major search engines and it is restricting access to all directories on the site.

I have also used Googles webmaster tools to slow the crawling of the googlebots but nothing seems to be working.

Does anyone have any suggestions on how to dramatically slow the crawling of googlebots?

Thanks!

Barry

Forbairt · Apr 28, 2008

disallow your images directories ?

Code:

Disallow: /images/

disallow certain file extensions ??? .jpeg .jpg . avi and so on

50gbs seems excessive (but then depending on the site ...)

bazzjazz · Apr 28, 2008

I have already disallowed all directories! Maybe the robots.txt file is not in the correct format. Here it is:

# Allows only major search engines and known friendly spiders
# Major Search Engines and Known Friendly Spiders (allowed)

User-agent: Googlebot
Disallow:
Crawl-delay: 10
Request-rate: 1/5
Visit-time: 0800-1000

User-agent: MSNBot
Disallow:

User-agent: Slurp
Disallow:

User-agent: Teoma
Disallow:

User-agent: Gigabot
Disallow:

User-agent: Scrubby
Disallow:

User-agent: Robozilla
Disallow:

# Everyone Else (NOT allowed)

User-agent: *
Disallow: /

#disallow any legit search engines from crawling the following directories

User-agent: *
Disallow: /stock
Disallow: /cgi-bin
Disallow: /cp
Disallow: /images
Disallow: /albums
Disallow: /tmp
Disallow: /admin
Disallow: /auth2
Disallow: /auth1
Disallow: /webalizer
Disallow: /graphics
Disallow: /covers
Disallow: /dict
Disallow: /dump
Disallow: /gallery
Disallow: /graphics
Disallow: /guest
Disallow: /includes
Disallow: /mailing_list
Disallow: /mp3
Disallow: /newcovers
Disallow: /order_form
Disallow: /ordering
Disallow: /photography
Disallow: /pictures
Disallow: /registration
Disallow: /scene
Disallow: /competition
Disallow: /specials

jmcc · Apr 28, 2008

bazzjazz said:
Not sure if this is the correct board for this...

If the pages are php then you could use something like this:
http://www.modem-help.freeserve.co.uk/download/bot-block.php.txt

User-agent: Teoma

Not really a player any more.

User-agent: Gigabot

A waste of bandwidth.

User-agent: Scrubby

Never heard of this one. Sounds like a maggot/scraper.

User-agent: Robozilla

Can't remember seeing this as a legit spider user agent.

There's a pile more that should be in there. And also look into including a scraper/ bad bot trap to block scrapers.

Keep an eye on this forum:
Search Engine Spider Identification:

Block all Chinese/Korean nets if necessary. Only allow users from the markets that the site is meant to serve if you really want to control access.

Regards...jmcc

bazzjazz · Apr 28, 2008

Hey John,

Thanks for the info, I'll get on the case tomorrow to investigate further.

regards,

Barry

ghost · Apr 29, 2008

robots and crawlers

bazzjazz said:
Total bandwidth for my shared server this month is about 130gb and most of this is robots crawling various sites.

50gbs seems excessive could you post (attach) your server log for us to view.

bazzjazz · Apr 30, 2008

You'll have to excuse my ignorance but I can only see my logs on a webpage!

Here is a link to a saved page with 50 most recent entries:
H-SPHERE roadrecs (basic unix)

I notice in one entry, that a googlebot is crawling the /stock directory but I have disallowed that in robots.txt with the following lines:

User-agent: *
Disallow: /stock
Disallow: /cgi-bin
Disallow: /cp
Disallow: /images
Disallow: /albums
Disallow: /tmp
Disallow: /admin

and more.....

The file name is 'robots.txt' and it is in the root folder but maybe I am doing something wrong with it as I am sure Google obeys robots.txt?

I haven't had a chance to investigate John's suggestions above yet but hope to get around to it asap.

Thanks for the help.

barry

jmcc · Apr 30, 2008

It may be that the bots are downloading the entire db repeatedly over the month. (Each php page is effectively a new page to most bots especially if there is any date type information included.) Google is not too bad and understands 304s (unchanged). Yahoo's Slurp is a bit screwed up. Microsoft's bot is so far beyond screwed up that it is in another universe. The bots should not be accessing the shopping cart. The revised robots.txt with the images/shopping etc added may make a difference. However it is limited by the number of times that the bots recheck robots.txt.

Regards...jmcc

ghost · Apr 30, 2008

Ran a quick dns check on the two googlebots from your logs and both seem genuine
66.249.65.101
66.249.70.234
Check here DNS Tools
One thing I noticed is googlebot and slurp are following what are possible empty links like this one
Basket is empty page
Road Records Shopping Section
I had a problem a few months back with excessive crawling by googlebot following empty links on an events calendar
googlebot was smart enough to pull out after a few identical pages but slurp needed a bit more persuasion.
http://www.irishwebmasterforum.com/...2-fake-googlebots-gobbling-my-band-width.html
Keep a check on your logs and see exactly what google is spidering and try to block it from spidering empty or irrelevant pages
also you should create an xml site map and include a link in your robots have a look at the two below.

http://www.coslia.com/robots.txt
http://www.coslia.com/sitemap.xml

bazzjazz · Apr 30, 2008

Thanks for both the replies and advice.

All the catalogue and shopping cart functionality is in the /stock directory so I have disallowed that the robots.txt and hopefully the next time google checks it, it will obey it.

I have also added <meta name="robots" content="noindex,nofollow" /> to the main catalogue/cart page within that directory so hopefully that will take effect.

i will keep track of the logs and see what effect this has.

cheers,

Barry

Crawlers, robots and excessive bandwidth

bazzjazz

New Member

Forbairt

Teaching / Designing / Developing

bazzjazz

New Member

jmcc

Active Member

bazzjazz

New Member

ghost

New Member

bazzjazz

New Member

jmcc

Active Member

ghost

New Member

bazzjazz

New Member