Website Audit Basics: Checking for Duplicate Content

Level343 / Content Marketing Tips / Content Audit Guides / Website Audit Basics: Checking for Duplicate Content

Learn about duplicate content, why checking for it is included in our website audits, and how you can keep it from messing up your site.

Date: April 23, 2021
Category: Content Audit Guides

If you’ve been online for any amount of time, you’ve most likely heard of Google’s Duplicate Content Penalty. It’s pretty famous.

It also doesn’t exist.

Or rather, it does, but maybe not the way many think it might exist – which is why we include checking for duplicate content within our list of website audit basics.

What is Duplicate Content?

Duplicate content is, in the purest sense of the word, pages that content the same information. Plagiarism is a perfect example and a prevalent issue online. The net is full of complaints from photographers, for example, who have had their images – not just copied – but used as if they were someone else’s.

But duplicate content is often less a matter of purposeful theft and more a matter of improper management of site structure. In short, it’s often perpetrated by the website itself.

But What Do You Mean There’s No Duplicate Content Penalty?

There are actual penalties, in which a page or website will be taken out of Google’s search index. Duplicate content, except in extreme cases and plagiarism, isn’t one of them.

So how do you get penalized for duplicate content?

In short, you’re cannibalizing your own results.

If you have two pages with the same information on them, you aren’t going to show up twice for the same search query. What’s going to happen is that Google will “decide” which version is the original and show that one.

No big deal, right? Wrong.

This means that the crawler has now crawled two pages out of your site’s crawl budget for only one result. Kind of a two-for-one sale. If you have several parameters and aren’t handling them properly, it could be even more pages for one result.

Ultimately, you want one targeted version of each page on your site to be available to search engines. Rather than block extra copies from search, however, it’s much better to alleviate the problem at its root.

What Causes Duplicate Content?

Besides scrapers, people who copy your site or page and pass it off as their own, there are several causes for duplicate content. Some include:

Lack of proper redirects

There are at least three ways that your site can have duplicate content simply because redirects aren’t properly implemented or aren’t implemented at all. As we do our website audits, we look to make sure only one version of the site can be reached. The top three redirects that should be happening on your site but often aren’t:

www vs non-www:

HTTP vs HTTPS:

A site should be either HTTP or HTTPS. The “S” means your site is secure. Most, if not all, websites should now be behind an SSL (Secure Socket Layer) certificate, according to Google. As far back as February 2018, Google announced that they would start marking HTTP sites as insecure, which reduces credibility.

Trailing slashes:

Especially a problem for WordPress sites, trailing slashes have become a real duplicate content cause. In short, a trailing slash is at the end of the URL. For example, the URL of this post could be either:

https://level343.com/wab-duplicate-content/

https://level343.com/wab-duplicate-content

Note the lack of the last “ / “ in the second example. It may be hard to believe that something so small could be any issue at all, but it is. Again, does it matter which one you choose? Not really but, also again, you need to pick one and stick to it.

Summary:

If your site can be reached by www AND non-www, or HTTP AND HTTPS, or with the trailing slash AND without, you have a duplicate content problem. Make sure only one version, whatever it is, can be reached by setting up proper 301 redirects.

If you want your site to be https://www.example.com/, you want to make sure that these four URLs all redirect to it:

http://example.com
https://example.com
http://www.example.com
https://www.example.com (no trailing slash)

If not, search engines are seeing multiple versions of your site.

Lack of proper canonicals

The above issues should also be addressed in your canonicals. Canonicals are links within your code that tell the search engines what the actual address of a page should be. On most sites, implementing canonicals takes less than 10 minutes and can make a big difference in how well your site performs.

It clears up issues such as multiple versions of a page getting because of URLs that have lots of parameters.

URL parameters:

?color=blue&filter=dress&sort=new anybody? There is very little reason for URLs containing parameters to ever appear in search. If you can think of one, we’re listening. Parameters usually consist of tracking codes, filtering information, item identification, session IDs, pagination and translation, among others.

While they can be helpful in terms of making sure the right information shows up on the page, they can also unnecessarily use up crawl budget. You want to make sure, to the best of your ability, that the search engine bots are only crawling the pages you want them to crawl – not waste energy on a single page with multiple variations.

Pagination:

Being able to click through a blog index is super helpful, but pagination isn’t always a good thing. As an extreme example, we recently had a client that had a sliding bar with blog posts positioned at the bottom of a landing page. The problem was that, as you clicked through the 1…2…3 of the sliding bar, the landing page changed URLs to include /page/1… /page/2…. /page/3

Blog categories should be paginated. A single landing page shouldn’t. If you want to include posts in your landing pages, consider choosing two or three that are tightly relevant. It’s better for your URL structure, duplicate content issues, internal linking structure, and your SEO.

Additional duplicate content issues:

Sometimes, a website will still contain links from when the site was in development. When this happens, clicking on the link will take the user to the development site; following the link takes the search engine crawler to the development site. Nobody wants that.

Conclusion

You can use programs such as CopyScape to find out if someone is plagiarizing your content. Often, however, you’ll find programs that crawl your own site to be more helpful for duplicate content. SEMRush, ScreamingFrog, AHrefs and MOZ are some of the more well-known ones.

All the above duplicate content issues can be fixed with proper due diligence. Website audits can help you find the issues that keep your site from providing an optimal experience for users, and an optimal crawl for search engines. When you’re not sure what your site looks like, contact us about our website optimization services. We’ll find the strengths and weaknesses that can help your site be the experience you want it to be.

An In depth Guide to Preferred Landing Pages: Build, Market and Test

5 Analytic Tools That Shows How Your Content Works

5 Analytics Tools That’ll Show How Your Content Works