20 August 2008 ~ Comments

The Debate Around Duplicate Content

Share

There has been a lot of discussion and debate around duplicate content.  When I helped put together the Tips from the T-List website for example, the whole idea behind using the Wordpress platform was to consolidate all the content into one place and then re-direct readers back out to the original posts and websites.  The idea behind this was that by collaborating and consolidating the content, the Tips from the T-List site would in essence act as a portal that would generate more traffic then any single blog could do by itelf.  Although this seems like a good idea there was a lot of concern around the duplication of content and whether such an approach would help or hinder the rankings of individual blog sites.  Some bloggers refused to add their rss feed to the site because of fear that their site rankings would be negatively affected.

The implications of creating a site that re-uses content from many sources seems to me to be something that is actually quite commonplace.  News aggregator sites for example pull in RSS feeds from other news sites, re-use content from press releases, and run stories across multiple sites.  In the travel business, there are hundreds of sites that have licensed the Lonely Planet, Rough Guide, or Columbus Guide content for their own websites.  In these cases, the companies are paying a license fee to the the original publishers to place duplicate content on their sites.

I wanted to find out what the implications are for duplicate content and how Google and other search engines actually treat duplicate content.  This is what I have discovered:

1. Only use content that is Creative Commons, CopyLeft, Opensource, or otherwise not copyrighted.  If you are going to use copyrighted material, be sure to follow the copyright exactly. Although this seems obvious, it is important to remember that by default, all written content is copyrighted unless otherwise stated and NOT the other way around.

2. When using duplicate content, try to rewrite the content or paraphrase the content in order to ensure that it is not exactly the same as the original.  You will probably still need to give the original author credit for the original work.

3. When aggregating RSS feeds, it is always best to notify the original author to let them know what you are doing.  This will avoid any uncomfortable questions about inappropriate usage later.  Be sure to link to the original post along with the article.

Google clearly identifies what it considers to be duplicate content.  These guidelines can be found on Googles Help site.  According to Google, content is considered duplicate when:

content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic.

So my question is… if the author of the content has agreed to allow another site to post the content, and the poster has added a link back to the original article, does this count as duplicate?  If the intention is not malicious but rather benign, can Google tell the difference?

In the case of consolidating or aggregating blog posts into a single site (like the Tips from the T-List), here is what Google recommends:

Syndicate carefully: If you syndicate your content on other sites, Google will always show the version we think is most appropriate for users in each given search, which may or may not be the version you’d prefer. However, it is helpful to ensure that each site on which your content is syndicated includes a link back to your original article. You can also ask those who use your syndicated material to block the version on their sites with robots.txt.

I know that Darren Cronian has had experiences in the past with splog sites re-using his content but does anyone have any real evidence that duplicate content issues have had a negative impact on your their blog readership, traffic, reputation, or rankings?

SEM
  • JX
    Your valid point is very bang on the button, I'm also starting to think that there seems to be echoes everywhere, I do think it can be useful for example where data is in another place where their is a focus; and a Heads Up! may be appreciated because of their the increasing decibels of datanoise, for example in a specialised FriendFeed room, however; dups do add to the growing main FF list.
    However, in the Online travel space I have noticed that a couple new blogs on the block are just kinda like a screenscrape, and don't add to the conversation at all, a bit like a few friends having a chat and in the group a someone nods just doesn't say anything except yeah, oh yeah; and nod their head, while they takes notes then go away and write about it and make out they come up with it all.
    Love your new blog design btw. @^@
  • Hi Stephen - good article on duplicate content. The question you asked was indirectly answered at SMX Advanced in Seattle (I attended in June 08). The consensus I picked up was that no you won't be penalized or hurt directly by legitimately using duplicate content in any of the applications you describe. Penalization by search engines is actually pretty rare and reserved for clear violations with 'intent' (it takes a lot of search engine resources to penalize when the algorithm can do it better naturally). I don't think using syndicated content is intent in the way you describe it.

    Google determines the original author of the duplicate content - or more correctly they 'try' to determine the original author. It's not something they disclose but one can surmise that they can base this on the basic first-past-the-post rule. If someone has already indexed this content on their website it can be attributed to them. Every other 'version' is filtered at the query. Someone may point out that their duplicate content is indexed and is therefore 'working' - however Google filters duplication at the query (when the search is conducted) And if the content is a duplicate and owned by someone else it will rarely make it into the search engine results - and I would say almost never high in results or for competitive searches.

    In an indirect way you may use up search engine resources when the bot indexes your site and . So say Googlebot arrives at your site and databases your 1000 pages of duplicate content before it gets to your original and proprietary content. It may never get to all of your original content and this is probably why Google would prefer you block it (robots.txt) or otherwise direct them (with XML sitemaps for eg).

    I say use syndicated content when it makes sense - but not for your SEO goals. You want truly original content for SEO benefits. Be careful about making small changes to content and expecting it to be considered 'fresh' - the reality is that Google understand semantics and language very very well and it would take a fairly significant effort to rewrite content to make it look 'new' again in the eyes of Google (Googly eyes?).

    (free tip) if that syndicated content you are considering has never been indexed in any form if you are the first to make it available to the bots you can potentially become the original owner of this content in the eyes of the search engines.
  • Thanks for the insights Scott. It sounds to me that a good strategy from a syndication side would be to wait 24 to 48 hours after the original article has been written and indexed before syndicating it. This way, the original article will be considered the real one and attributed to them versus the one that gets posted on the aggregation site. Since the purpose, in this case, is not search engine rankings or placement but to provide a variety of content from contributors (all around a primary industry) into a unified source, it should work fine and keep everyone (including the search engines) happy.
  • Can we not call it a penalty – where duplicate content is concerned Google does not penalise sites in respect of it’s rankings in the search engines. Let’s pretend article 1 and article 2 are on two pieces of paper. It looks at them together and identifies the level of duplication.

    No one knows what percentage of duplication activates this filter but let’s pretend it’s 80%.

    If article 2 has 80% duplication of article 1 then Google filters one of these pages from the search engine results. How it decides which one it filters out no one knows but it’s speculated that it will be around how old the domain is, how much of an authority the site is etc.

    Let’s pretend..

    I have a blog of 8 months old, and Tips from the T-list is 18 months old, and is treated more of an authority, because of the number of quality links pointing to it. Google is more likely to filter out my blog post, loosing me traffic and potential revenue, than it is the Tips from the T-list blog.

    For me as a blogger that would be a problem.

    Rather than publishing the full content, why do you not publish an extract and link through to the full blog post.

    The Associated Press recently changed it’s policy on bloggers using their content because it was finding that their content was getting freezed out of Google because authority blogs like Tech Crunch carried more weight in Google and Associated Press were loosing traffic.
  • So is it fair to say that in your example, where the blog is newer than the aggregate blog, that the aggregate blog may affect the syndicated blog's viewership by ranking higher and with more authority then the blog being syndicated? If there is a link back to the original blog post, does that increase the syndicated blogs page rank?

    Regarding your comment about publishing full versus extracts; for the most part whether the feed is full or just partial has been left up to the author of the syndicated blog when they send their feed in. Some of the feeds are just titles, some are titles with descriptions, and some are full feeds. I suppose it really depends on the comfort level of the individual blog author and whether, like travel-rants.com, driving traffic to your blog directly affects your income from the blog. Commercial blogs (like travel-rants.com) definitely have different and legitimate concerns regarding re-use of content and sharing of traffic.

    Is that a fair statement or am I off-base here?
  • Sorry for the delayed response Stephen.

    Yes, I do think it depends on the owner of the blog. I’ve never categorised my blog as commercial but I suppose it is considering I have ads on the blog. Loosing revenue isn’t a problem for me because I don’t get paid out on the traffic I get (at the moment) paid up front.

    Even so, I don’t want people ‘stealing’ my content that I have worked hard at writing, so I would never provide full-version on my RSS feed. I also find extract feed means more people click through to the blog.

    I’ve no need to visit TIPs at the moment because all of the content is on my RSS reader.

    In my opinion I think TIPs would be a better more useful resource if people wrote unique content, exclusively for TIPs that you could also feature in a future book. This would make it more of an authority and be a group effort. Maybe share ad sense revenue as compensation for writing content for the blog.
  • Well said Darren. That would certainly be the ideal model in my mind as well.
  • I used to publish my articles, but now I wander should I stop doing this, because the risk of duplicate content penalty. Should I stop publish my articles on article directories?
  • The real challenge seems to have less to do with the actual duplication of content but rather who is getting the credit for the original article. A good way to test this is to do a search for the exact title of your article and see which version comes up first. If the article directory version comes up first then it is considered the authoritative source for the article. In this case, you will probably want to rename your articles differently and change the content so that it is not seen as duplicates.
blog comments powered by Disqus