Crawlers, robots and excessive bandwidth

Status
Not open for further replies.

bazzjazz

New Member
Not sure if this is the correct board for this...

I host a clients site on a shared server with 50gb bandwidth allowance per month. At the moment Googlebots are using the entire 50gb and more crawling the site. Total bandwidth for my shared server this month is about 130gb and most of this is robots crawling various sites.

I have a robots.txt file which is only allowing the major search engines and it is restricting access to all directories on the site.

I have also used Googles webmaster tools to slow the crawling of the googlebots but nothing seems to be working.

Does anyone have any suggestions on how to dramatically slow the crawling of googlebots?

Thanks!

Barry
 

Forbairt

Teaching / Designing / Developing
disallow your images directories ?
Code:
Disallow: /images/

disallow certain file extensions ??? .jpeg .jpg . avi and so on

50gbs seems excessive (but then depending on the site ...)
 

bazzjazz

New Member
I have already disallowed all directories! Maybe the robots.txt file is not in the correct format. Here it is:

# Allows only major search engines and known friendly spiders
# Major Search Engines and Known Friendly Spiders (allowed)

User-agent: Googlebot
Disallow:
Crawl-delay: 10
Request-rate: 1/5
Visit-time: 0800-1000

User-agent: MSNBot
Disallow:

User-agent: Slurp
Disallow:

User-agent: Teoma
Disallow:

User-agent: Gigabot
Disallow:

User-agent: Scrubby
Disallow:

User-agent: Robozilla
Disallow:

# Everyone Else (NOT allowed)

User-agent: *
Disallow: /

#disallow any legit search engines from crawling the following directories

User-agent: *
Disallow: /stock
Disallow: /cgi-bin
Disallow: /cp
Disallow: /images
Disallow: /albums
Disallow: /tmp
Disallow: /admin
Disallow: /auth2
Disallow: /auth1
Disallow: /webalizer
Disallow: /graphics
Disallow: /covers
Disallow: /dict
Disallow: /dump
Disallow: /gallery
Disallow: /graphics
Disallow: /guest
Disallow: /includes
Disallow: /mailing_list
Disallow: /mp3
Disallow: /newcovers
Disallow: /order_form
Disallow: /ordering
Disallow: /photography
Disallow: /pictures
Disallow: /registration
Disallow: /scene
Disallow: /competition
Disallow: /specials
 

jmcc

Active Member
Not sure if this is the correct board for this...
If the pages are php then you could use something like this:
http://www.modem-help.freeserve.co.uk/download/bot-block.php.txt

User-agent: Teoma
Not really a player any more.

User-agent: Gigabot
A waste of bandwidth.

User-agent: Scrubby
Never heard of this one. Sounds like a maggot/scraper.

User-agent: Robozilla
Can't remember seeing this as a legit spider user agent.

There's a pile more that should be in there. And also look into including a scraper/ bad bot trap to block scrapers.

Keep an eye on this forum:
Search Engine Spider Identification:

Block all Chinese/Korean nets if necessary. Only allow users from the markets that the site is meant to serve if you really want to control access.

Regards...jmcc
 

bazzjazz

New Member
You'll have to excuse my ignorance but I can only see my logs on a webpage!

Here is a link to a saved page with 50 most recent entries:
H-SPHERE roadrecs (basic unix)

I notice in one entry, that a googlebot is crawling the /stock directory but I have disallowed that in robots.txt with the following lines:

User-agent: *
Disallow: /stock
Disallow: /cgi-bin
Disallow: /cp
Disallow: /images
Disallow: /albums
Disallow: /tmp
Disallow: /admin

and more.....

The file name is 'robots.txt' and it is in the root folder but maybe I am doing something wrong with it as I am sure Google obeys robots.txt?

I haven't had a chance to investigate John's suggestions above yet but hope to get around to it asap.

Thanks for the help.

barry
 

jmcc

Active Member
It may be that the bots are downloading the entire db repeatedly over the month. (Each php page is effectively a new page to most bots especially if there is any date type information included.) Google is not too bad and understands 304s (unchanged). Yahoo's Slurp is a bit screwed up. Microsoft's bot is so far beyond screwed up that it is in another universe. The bots should not be accessing the shopping cart. The revised robots.txt with the images/shopping etc added may make a difference. However it is limited by the number of times that the bots recheck robots.txt.

Regards...jmcc
 

ghost

New Member
Ran a quick dns check on the two googlebots from your logs and both seem genuine
66.249.65.101
66.249.70.234
Check here DNS Tools
One thing I noticed is googlebot and slurp are following what are possible empty links like this one
Basket is empty page
Road Records Shopping Section
I had a problem a few months back with excessive crawling by googlebot following empty links on an events calendar
googlebot was smart enough to pull out after a few identical pages but slurp needed a bit more persuasion.
http://www.irishwebmasterforum.com/...2-fake-googlebots-gobbling-my-band-width.html
Keep a check on your logs and see exactly what google is spidering and try to block it from spidering empty or irrelevant pages
also you should create an xml site map and include a link in your robots have a look at the two below.

http://www.coslia.com/robots.txt
http://www.coslia.com/sitemap.xml
 

bazzjazz

New Member
Thanks for both the replies and advice.

All the catalogue and shopping cart functionality is in the /stock directory so I have disallowed that the robots.txt and hopefully the next time google checks it, it will obey it.

I have also added <meta name="robots" content="noindex,nofollow" /> to the main catalogue/cart page within that directory so hopefully that will take effect.

i will keep track of the logs and see what effect this has.

cheers,

Barry
 
Status
Not open for further replies.
Top