Irish SEO,  Marketing & Webmaster Discussion

 

Crawlers, robots and excessive bandwidth

This is a discussion on Crawlers, robots and excessive bandwidth within the Server / Technical Administration Tips and Queries forums, part of the Webmaster Help category; Not sure if this is the correct board for this... I host a clients site on a shared server with ...


Go Back   Irish SEO, Marketing & Webmaster Discussion > Webmaster Help > Server / Technical Administration Tips and Queries

Register Forum Rules FAQDonate Members List Calendar Search Today's Posts Mark Forums Read


Notices

Reply

 

LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 28-04-2008, 11:05 AM
Frontpage User
 
Join Date: Oct 2007
Posts: 11
Nominated 0 Times in 0 Posts
TOTW/F/M Award(s): 0
bazzjazz will become famous soon enough
Default Crawlers, robots and excessive bandwidth

Not sure if this is the correct board for this...

I host a clients site on a shared server with 50gb bandwidth allowance per month. At the moment Googlebots are using the entire 50gb and more crawling the site. Total bandwidth for my shared server this month is about 130gb and most of this is robots crawling various sites.

I have a robots.txt file which is only allowing the major search engines and it is restricting access to all directories on the site.

I have also used Googles webmaster tools to slow the crawling of the googlebots but nothing seems to be working.

Does anyone have any suggestions on how to dramatically slow the crawling of googlebots?

Thanks!

Barry
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
  #2 (permalink)  
Old 28-04-2008, 12:17 PM
Forbairt's Avatar
respect my AW-THOR-IT-AYY
 
Join Date: Jun 2007
Location: My Office, Dublin
Posts: 2,022
Nominated 2 Times in 1 Post
Nominated TOTW/F/M Award(s): 1
Forbairt will become famous soon enoughForbairt will become famous soon enoughForbairt will become famous soon enoughForbairt will become famous soon enoughForbairt will become famous soon enoughForbairt will become famous soon enoughForbairt will become famous soon enoughForbairt will become famous soon enoughForbairt will become famous soon enoughForbairt will become famous soon enoughForbairt will become famous soon enough
Send a message via AIM to Forbairt Send a message via MSN to Forbairt Send a message via Yahoo to Forbairt Send a message via Skype™ to Forbairt
Default

disallow your images directories ?
Code:
Disallow: /images/
disallow certain file extensions ??? .jpeg .jpg . avi and so on

50gbs seems excessive (but then depending on the site ...)
__________________
Forbairt Media | Web Design & Development Galway / Dublin, Ireland - coming soon ... ( vague but descriptive isn't it )
Recent Work: Safari Club African Safari Holidays - South Africa Safaris
Other Stuff: FluffyLinkulator Rapid Inclusion Service Tools
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
  #3 (permalink)  
Old 28-04-2008, 12:40 PM
Frontpage User
 
Join Date: Oct 2007
Posts: 11
Nominated 0 Times in 0 Posts
TOTW/F/M Award(s): 0
bazzjazz will become famous soon enough
Default

I have already disallowed all directories! Maybe the robots.txt file is not in the correct format. Here it is:

# Allows only major search engines and known friendly spiders
# Major Search Engines and Known Friendly Spiders (allowed)

User-agent: Googlebot
Disallow:
Crawl-delay: 10
Request-rate: 1/5
Visit-time: 0800-1000

User-agent: MSNBot
Disallow:

User-agent: Slurp
Disallow:

User-agent: Teoma
Disallow:

User-agent: Gigabot
Disallow:

User-agent: Scrubby
Disallow:

User-agent: Robozilla
Disallow:

# Everyone Else (NOT allowed)

User-agent: *
Disallow: /

#disallow any legit search engines from crawling the following directories

User-agent: *
Disallow: /stock
Disallow: /cgi-bin
Disallow: /cp
Disallow: /images
Disallow: /albums
Disallow: /tmp
Disallow: /admin
Disallow: /auth2
Disallow: /auth1
Disallow: /webalizer
Disallow: /graphics
Disallow: /covers
Disallow: /dict
Disallow: /dump
Disallow: /gallery
Disallow: /graphics
Disallow: /guest
Disallow: /includes
Disallow: /mailing_list
Disallow: /mp3
Disallow: /newcovers
Disallow: /order_form
Disallow: /ordering
Disallow: /photography
Disallow: /pictures
Disallow: /registration
Disallow: /scene
Disallow: /competition
Disallow: /specials
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
  #4 (permalink)  
Old 28-04-2008, 12:55 PM
jmcc's Avatar
Wannabe Geek
 
Join Date: Feb 2006
Posts: 298
Nominated 0 Times in 0 Posts
TOTW/F/M Award(s): 0
jmcc has a spectacular aura about
Default

Quote:
Originally Posted by bazzjazz View Post
Not sure if this is the correct board for this...
If the pages are php then you could use something like this:
http://www.modem-help.freeserve.co.u...-block.php.txt

Quote:
User-agent: Teoma
Not really a player any more.

Quote:
User-agent: Gigabot
A waste of bandwidth.

Quote:
User-agent: Scrubby
Never heard of this one. Sounds like a maggot/scraper.

Quote:
User-agent: Robozilla
Can't remember seeing this as a legit spider user agent.

There's a pile more that should be in there. And also look into including a scraper/ bad bot trap to block scrapers.

Keep an eye on this forum:
Search Engine Spider Identification:

Block all Chinese/Korean nets if necessary. Only allow users from the markets that the site is meant to serve if you really want to control access.

Regards...jmcc
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
  #5 (permalink)  
Old 28-04-2008, 03:31 PM
Frontpage User
 
Join Date: Oct 2007
Posts: 11
Nominated 0 Times in 0 Posts
TOTW/F/M Award(s): 0
bazzjazz will become famous soon enough
Default

Hey John,

Thanks for the info, I'll get on the case tomorrow to investigate further.

regards,

Barry
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
  #6 (permalink)  
Old 29-04-2008, 12:04 PM
ghost's Avatar
Wannabe Geek
 
Join Date: Dec 2007
Location: Ennis
Posts: 160
Nominated 0 Times in 0 Posts
TOTW/F/M Award(s): 0
ghost will become famous soon enough
Default robots and crawlers

Quote:
Originally Posted by bazzjazz View Post
Total bandwidth for my shared server this month is about 130gb and most of this is robots crawling various sites.
50gbs seems excessive could you post (attach) your server log for us to view.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
  #7 (permalink)  
Old 30-04-2008, 11:12 AM
Frontpage User
 
Join Date: Oct 2007
Posts: 11
Nominated 0 Times in 0 Posts
TOTW/F/M Award(s): 0
bazzjazz will become famous soon enough
Default

You'll have to excuse my ignorance but I can only see my logs on a webpage!

Here is a link to a saved page with 50 most recent entries:
H-SPHERE roadrecs (basic unix)

I notice in one entry, that a googlebot is crawling the /stock directory but I have disallowed that in robots.txt with the following lines:

User-agent: *
Disallow: /stock
Disallow: /cgi-bin
Disallow: /cp
Disallow: /images
Disallow: /albums
Disallow: /tmp
Disallow: /admin

and more.....

The file name is 'robots.txt' and it is in the root folder but maybe I am doing something wrong with it as I am sure Google obeys robots.txt?

I haven't had a chance to investigate John's suggestions above yet but hope to get around to it asap.

Thanks for the help.

barry
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
  #8 (permalink)  
Old 30-04-2008, 12:07 PM
jmcc's Avatar
Wannabe Geek
 
Join Date: Feb 2006
Posts: 298
Nominated 0 Times in 0 Posts
TOTW/F/M Award(s): 0
jmcc has a spectacular aura about
Default

It may be that the bots are downloading the entire db repeatedly over the month. (Each php page is effectively a new page to most bots especially if there is any date type information included.) Google is not too bad and understands 304s (unchanged). Yahoo's Slurp is a bit screwed up. Microsoft's bot is so far beyond screwed up that it is in another universe. The bots should not be accessing the shopping cart. The revised robots.txt with the images/shopping etc added may make a difference. However it is limited by the number of times that the bots recheck robots.txt.

Regards...jmcc
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
  #9 (permalink)  
Old 30-04-2008, 12:49 PM
ghost's Avatar
Wannabe Geek
 
Join Date: Dec 2007
Location: Ennis
Posts: 160
Nominated 0 Times in 0 Posts
TOTW/F/M Award(s): 0
ghost will become famous soon enough
Default

Ran a quick dns check on the two googlebots from your logs and both seem genuine
66.249.65.101
66.249.70.234
Check here DNS Tools
One thing I noticed is googlebot and slurp are following what are possible empty links like this one
Basket is empty page
Road Records Shopping Section
I had a problem a few months back with excessive crawling by googlebot following empty links on an events calendar
googlebot was smart enough to pull out after a few identical pages but slurp needed a bit more persuasion.
fake googlebots gobbling my band width
Keep a check on your logs and see exactly what google is spidering and try to block it from spidering empty or irrelevant pages
also you should create an xml site map and include a link in your robots have a look at the two below.

http://www.coslia.com/robots.txt
http://www.coslia.com/sitemap.xml
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
  #10 (permalink)  
Old 30-04-2008, 03:36 PM
Frontpage User
 
Join Date: Oct 2007
Posts: 11
Nominated 0 Times in 0 Posts
TOTW/F/M Award(s): 0
bazzjazz will become famous soon enough
Default

Thanks for both the replies and advice.

All the catalogue and shopping cart functionality is in the /stock directory so I have disallowed that the robots.txt and hopefully the next time google checks it, it will obey it.

I have also added <meta name="robots" content="noindex,nofollow" /> to the main catalogue/cart page within that directory so hopefully that will take effect.

i will keep track of the logs and see what effect this has.

cheers,

Barry
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
Reply

Tags
bandwidth, crawlers, excessive, robots

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT +1. The time now is 06:31 AM.


Powered by: vBulletin Version 3.7.3, Copyright ©2000 - 2008, Jelsoft Enterprises Limited.
Hosted in Ireland by Blacknight - Test your ISP |Irish Hosting Directory| Armchair.ie|Logo by Eden Web Design|Avatars by Afterglow |Latest Blog Entries | VPS HostingAd Management by RedTyger

Search Engine Friendly URLs by vBSEO 3.2.0