Scraping makes a whole lot of sense while dealing with content recycling problem. By content recycling, I mean here republishing content which is no longer available / searchable on Google which has been removed from Google index. The procedure is to pick up websites from directories as archives.org , checking it on Google index and if it does not exist, publishing it somewhere and submitting to Google’s index (by lot of ways we know). Content scraping is otherwise, dealt as duplicate content penalty from Google if the index check is not done and content already exists on Google.
The problem given is to algorithmically write content for Search Engine Optimization of a travel portal (name confidential). The simple scraping obviously is not good majorly because of newly launched Google BigDaddy changes as they are called. The major changes (believed) are
- Google has started using new 64 bit architecture
-
There has been major updates on
- Google Infrastructure
-
Google Search Algorithm
The very first absurd change is Google now treats http://example.com, http://www.example.com and http://example.com/index.php (for example) as different url’s. Different copies of these pages are treated as duplicate content. So while doing Search Engine Marketing, one need to be sure that only one url format needs to be followed throughout for marketing because Google will anyways penalize other url’s and problems will arise if the url for which lot of SEM has been done gets penalized. This is not good as part of Google as atleast Google is expected to understand these simple url conventions. Though, I am pretty sure that there must be some strong reason behind this. One of the reason I see behind this is emergence of prefixes like www2, e.g., www2.example.com should be treated as different website than www.example.com .
Other intelligent change done by Google under BigDaddy change is with a basic presumption that thousands of pages can never be created overnight, Google enforces a time penalty too on url’s. Before giving any page rank to any new website, Google keeps the page index as sandboxed for verification against time penalty and content duplication penalty. Let me discuss here something about content duplication penalty. Some optimizers scrape any content and put on their websites for increasing content (more content == more value in Google) . This may even penalize sometimes the site from which the content is being copied affecting their hard work a lot. Only way to deal such content scrapers is periodically search out for copied content from your website, track them down and submit to Google as spam. A very legitimate issue which is of concern to lot of new websites now is news shared by people in their blogs, forums etc. as visitors or users wants to see and discuss these articles in their favorite forums.
One way to do scraping as I was discussing is to do “desert scraping”, in which you scrape content that no longer exists in Google Index. One can verify from copyscape.com if the content has already been copied elsewhere and if not, such approach can actually result in content recycling in a very useful way. This approach is very much useful to optimize long tail search keywords. Content sources can be easily picked from dead-links.com. But still, the optimization and marketing problem has become very complex now as some things in Google like supplement pages have no reasons at all.
Will really appreciate if someone adds their useful inputs via comments.

Nice article…Anyways I think many of us know which travel portal you are referring to here…:D…
is it.. great!! anyways thats my mentor’s project. Have to give my best as well as best available. Would really look forward to connect in case you want to discuss more for it. I think these are the basic things one needs to know before entering into online world dominated by Google
Great article.. doesnt google have archive’s content also in its index? After all archive too may be treated as just another website by Google.
Very true. Google do has archive content indexed in its machines. But Google regularly keeps deleting files from its index (through some algorithm, probably no hits or copied or 100s of factors in algorithm). The solution for scraping is once we pick up websites from archive.org, we need to check whether content is indexed by Google or not through copyscape or Google search. Then only it makes sense to recycle the content.