Thanks ...yeah agree for the major sites listed feeds would definitley be an easier option. Been trying out a more generic approach that I can adopt to other websites that don't supply feeds. Been using an open source project called
nutch for the crawling and so far it's been scaling pretty well. Have set up a small two node cluster it uses the map/reduce algorithm to carry out the work. Haven't really pushed it might try adding some more sites and see how it gets on. Been thinking of adding DVDs in..