This is a discussion on Crawlers, robots and excessive bandwidth within the Server / Technical Administration Tips and Queries forums, part of the Webmaster Help category; Not sure if this is the correct board for this... I host a clients site on a shared server with ...
| |||||||
| Register | Forum Rules | FAQ | Donate | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Not sure if this is the correct board for this... I host a clients site on a shared server with 50gb bandwidth allowance per month. At the moment Googlebots are using the entire 50gb and more crawling the site. Total bandwidth for my shared server this month is about 130gb and most of this is robots crawling various sites. I have a robots.txt file which is only allowing the major search engines and it is restricting access to all directories on the site. I have also used Googles webmaster tools to slow the crawling of the googlebots but nothing seems to be working. Does anyone have any suggestions on how to dramatically slow the crawling of googlebots? Thanks! Barry |
| |||||
| disallow your images directories ? Code: Disallow: /images/ 50gbs seems excessive (but then depending on the site ...)
__________________ Forbairt Media | Web Design & Development Galway / Dublin, Ireland - coming soon ... ( vague but descriptive isn't it ) Recent Work: Safari Club African Safari Holidays - South Africa Safaris Other Stuff: FluffyLinkulator Rapid Inclusion Service Tools |
| ||||
| I have already disallowed all directories! Maybe the robots.txt file is not in the correct format. Here it is: # Allows only major search engines and known friendly spiders # Major Search Engines and Known Friendly Spiders (allowed) User-agent: Googlebot Disallow: Crawl-delay: 10 Request-rate: 1/5 Visit-time: 0800-1000 User-agent: MSNBot Disallow: User-agent: Slurp Disallow: User-agent: Teoma Disallow: User-agent: Gigabot Disallow: User-agent: Scrubby Disallow: User-agent: Robozilla Disallow: # Everyone Else (NOT allowed) User-agent: * Disallow: / #disallow any legit search engines from crawling the following directories User-agent: * Disallow: /stock Disallow: /cgi-bin Disallow: /cp Disallow: /images Disallow: /albums Disallow: /tmp Disallow: /admin Disallow: /auth2 Disallow: /auth1 Disallow: /webalizer Disallow: /graphics Disallow: /covers Disallow: /dict Disallow: /dump Disallow: /gallery Disallow: /graphics Disallow: /guest Disallow: /includes Disallow: /mailing_list Disallow: /mp3 Disallow: /newcovers Disallow: /order_form Disallow: /ordering Disallow: /photography Disallow: /pictures Disallow: /registration Disallow: /scene Disallow: /competition Disallow: /specials |
| |||||
| If the pages are php then you could use something like this: http://www.modem-help.freeserve.co.u...-block.php.txt Quote:
Quote:
Quote:
Quote:
There's a pile more that should be in there. And also look into including a scraper/ bad bot trap to block scrapers. Keep an eye on this forum: Search Engine Spider Identification: Block all Chinese/Korean nets if necessary. Only allow users from the markets that the site is meant to serve if you really want to control access. Regards...jmcc |
| ||||
| You'll have to excuse my ignorance but I can only see my logs on a webpage! Here is a link to a saved page with 50 most recent entries: H-SPHERE roadrecs (basic unix) I notice in one entry, that a googlebot is crawling the /stock directory but I have disallowed that in robots.txt with the following lines: User-agent: * Disallow: /stock Disallow: /cgi-bin Disallow: /cp Disallow: /images Disallow: /albums Disallow: /tmp Disallow: /admin and more..... The file name is 'robots.txt' and it is in the root folder but maybe I am doing something wrong with it as I am sure Google obeys robots.txt? I haven't had a chance to investigate John's suggestions above yet but hope to get around to it asap. Thanks for the help. barry |
| |||||
| It may be that the bots are downloading the entire db repeatedly over the month. (Each php page is effectively a new page to most bots especially if there is any date type information included.) Google is not too bad and understands 304s (unchanged). Yahoo's Slurp is a bit screwed up. Microsoft's bot is so far beyond screwed up that it is in another universe. The bots should not be accessing the shopping cart. The revised robots.txt with the images/shopping etc added may make a difference. However it is limited by the number of times that the bots recheck robots.txt. Regards...jmcc |
| |||||
| Ran a quick dns check on the two googlebots from your logs and both seem genuine 66.249.65.101 66.249.70.234 Check here DNS Tools One thing I noticed is googlebot and slurp are following what are possible empty links like this one Basket is empty page Road Records Shopping Section I had a problem a few months back with excessive crawling by googlebot following empty links on an events calendar googlebot was smart enough to pull out after a few identical pages but slurp needed a bit more persuasion. fake googlebots gobbling my band width Keep a check on your logs and see exactly what google is spidering and try to block it from spidering empty or irrelevant pages also you should create an xml site map and include a link in your robots have a look at the two below. http://www.coslia.com/robots.txt http://www.coslia.com/sitemap.xml |
| ||||
| Thanks for both the replies and advice. All the catalogue and shopping cart functionality is in the /stock directory so I have disallowed that the robots.txt and hopefully the next time google checks it, it will obey it. I have also added <meta name="robots" content="noindex,nofollow" /> to the main catalogue/cart page within that directory so hopefully that will take effect. i will keep track of the logs and see what effect this has. cheers, Barry |