View Single Post

  #1 (permalink)  
Old 10-09-2008, 02:34 AM
Cormac's Avatar
Cormac Cormac is offline
Cormac Moylan
 
Join Date: Jan 2006
Location: Cork
Posts: 1,260
Nominated 0 Times in 0 Posts
TOTW/F/M Award(s): 0
Thanks: 0
Thanked 0 Times in 0 Posts
Cormac has much to be proud ofCormac has much to be proud ofCormac has much to be proud ofCormac has much to be proud ofCormac has much to be proud ofCormac has much to be proud ofCormac has much to be proud ofCormac has much to be proud ofCormac has much to be proud of
Default Google not accepting robots.txt rules

One of my sites has 141 pages indexed in Google in little over a fortnight. The site uses a shopping cart application which hooks up to Amazon and displays Amazon listings.

The shopping cart is powered by associate-o-matic which brings down a LOT of content from Amazon. I was concerned about duplicated content so I setup some modrewrite rules and I restricted indexing (robots.txt) of all URLS which contain a query string.

I tested this robots.txt file against a number of pages from my site via the Google Webmaster Console. Each and every time the robots.txt analyzer said that the page are restricted.

I permitted the inclusion of 14 entry pages via robots.txt and via a sitemap.xml file. These 14 entry pages are the only ones indexed in Yahoo.com. Yahoo has prevented indexing of the duplicated content (as it should do, well done Yahoo).

Google on the other hand has completely ignorned the robots.txt file and has indexed over a 100 pages of duplicate content which I said not to index.

In the Google Webmaster Console I have an alert stating that approx 250 URLs are restricted by robots.txt. But a lot of those 250 URLs are appearing in Google's index.

I can't understand why Google is doing this. Yahoo is playing ball and being correct by following my rules but Google is potentially lining me up for possible dup content issues further down the line.

Has anybody encountered any similar issues to that of mine? I can't disclose the URL at this time as the site is a work in progress.
Reply With Quote