Duplicate Content: The Impact of Canonical URLs

Being a web developer I am trying to become savvier when it comes to factoring additional SEO practices, which is generally considered (in my view) compulsory. 

Ever since Google updated its Search Console (formally known as Webmaster Tools), it has opened my eyes to how my site is performing in greater detail, especially the pages Google deems as links not worthy for indexing. I started becoming more aware of this last August, when I wrote a post about attempting to reduce the number "Crawled - Currently not indexed" pages of my site. Through trial and error managed to find a way to reduce the excluded number of page links.

The area I have now become fixated on is the sheer number of pages being classed as "Duplicate without user-selected canonical". Google describes these pages as:

This page has duplicates, none of which is marked canonical. We think this page is not the canonical one. You should explicitly mark the canonical for this page. Inspecting this URL should show the Google-selected canonical URL.

In simplistic terms, Google has detected there are pages that can be accessed by different URL's with either same or similar content. In my case, this is the result of many years of unintentional neglect whilst migrating my site through different platforms and URL structures during the infancy of my online presence. 

Google Search Console has marked around 240 links as duplicates due to the following two reasons:

  1. Pages can be accessed with or without a ".aspx" extension.
  2. Paginated content.

I was surprised to see paginated content was classed as duplicate content, as I was always under the impression that this would never be the case. After all, the listed content is different and I have ensured that the page titles are different for when content is filtered by either category or tag. However, if a site consists of duplicate or similar content, it is considered a negative in the eyes of a search engine. 

Two weeks ago I added canonical tagging across my site, as I was intrigued to see if there would be any considerable change towards how Google crawls my site. Would it make my site easier to crawl and aid Google in understanding the page structure?

Surprising Outcome

I think I was quite naive about how my Search Console Coverage statistics would shift post cononicalisation. I was just expecting the number of pages classed as "Duplicate without user-selected canonical" to decrease, which was the case. I wasn't expecting anything more. On further investigation, it was interesting to see an overall positive change across all other coverage areas.

Here's the full breakdown:

  • Duplicate without user-selected canonical: Reduced by 10 pages
  • Crawled - Currently not indexed: Reduced by 65 pages
  • Crawl anomaly: Reduced by 20 pages
  • Valid : Increased by 60 pages

The change in figures may not look that impressive, but we have to remember this report is based only on two weeks after implementing canonical tags. All positives so far and I'm expecting to see further improvements over the coming weeks. 

Conclusion

Canonical markup can often be overlooked, both in its implementation and importance when it comes to SEO. After all, I still see sites that don't use them as the emphasis is placed on other areas that require more effort to ensure it meets Google's search criteria, such as building for mobile, structured data and performance. So it's understandable why canonical tags could be missed.

If you are in a similar position to me, where you are adding canonical markup to an existing site, it's really important to spend the time to set the original source page URL correctly the first time as the incorrect implementation can lead to issues.

Even though my Search Console stats have improved, the jury's still out to whether this translates to better site visibility across search engines. But anything that helps search engines and visitors understand your content source can only be beneficial.

Reducing The Number of 'Crawled - Currently not indexed' Pages

Every few weeks, I check over the health of my site through Google Search Console (aka Webmaster Tools) and Analytics to see how Google is indexing my site and look into potential issues that could affect the click-through rate.

Over the years the content of my site has grown steadily and as it stands it consists of 250 published blog posts. When you take into consideration other potential pages Google indexes - consisting of filter URL's based on grouping posts by tag or category, the number of links that my site consists is increased considerably. It's to the discretion of Google's search algorithm to whether it includes these links for indexing.

Last month, I decided to scrutinise the Search Console Index Coverage report in great detail just to see if there are any improvements I can make to alleviate some minor issues. What I wasn't expecting to see is the large volume of links marked as "Crawled - Currently not indexed".

Crawled Currently Not Indexed - 225 Pages

Wow! 225 affected pages! What does "Crawled - Currently not indexed" mean? According to Google:

The page was crawled by Google, but not indexed. It may or may not be indexed in the future; no need to resubmit this URL for crawling.

Pretty self-explanatory but not much guidance on the process on how to lessen the number of links that aren't indexed. From my experience, the best place to start is to look at the list of links that are being excluded and to form a judgement based on the page content of these links. Unfortunately, there isn't an exact science. It's a process of trial and error.

Let's take a look at the links from my own 225 excluded pages:

Crawled Currently Not Indexed - Non Indexed Links

On initial look, I could see that the majority of the URL's consisted of links where users can filter posts by either category or tag. I could see nothing content-wise when inspecting these pages for a conclusive reason for index exclusion. However, what I did notice is that these links were automatically found by Google when the site gets spidered. The sitemap I submitted in the Search Console only list out blog posts and content pages.

This led me to believe a possible solution would be to create a separate sitemap that consisted purely of links for these categories and tags. I called it metasitemap.xml. Whenever I added a post, the sitemap's "lastmod" date would get updated, just like the pages listed in the default sitemap.

I created and submitted this new sitemap around mid-July and it wasn't until four days ago the improvement was reported from within the Search Console. The number of non-indexed pages was reduced to 58. That's a 74% reduction!

Crawled Currently Not Indexed - 58 Pages

Conclusion

As I stated above, there isn't an exact science for reducing the number of non-indexed pages as every site is different. Supplementing my site with an additional sitemap just happened to alleviate my issue. But that is not to say copying this approach won't help you. Just ensure you look into the list of excluded links for any patterns.

I still have some work to do and the next thing on my list is to implement canonical tags in all my pages since I have become aware I have duplicate content on different URL's - remnants to when I moved blogging platform.

If anyone has any other suggestions or solutions that worked for them, please leave a comment.

Dynamic Robots.txt file for ASP.NET MVC Sites

Changing the contents of a robots.txt file when a site is moved from staging to a live environment is quite a manual and somewhat cumbersome process. I sometimes have the fear of forgetting to replace the "Disallow: /" line with correct indexable entries required for the website.

To give me one less thing to remember during my pre-live deployments, all my current and upcoming ASP.NET MVC sites will use a dynamic robots.txt file containing different entries depending on whether a site is in stage or live. In order to do this, we need to let the ASP.NET MVC application serve up the robots.txt file. This can be done by the following:

  • Create a new controller called "SEOController"
  • Add a new FileContentResult action called "Robots".
  • Add a new route.
  • Modify web.config.

SEOController

I have created a new controller that renders our "Robots" action as a plain text file. As you can see, my controller is not a type of "ActionResult" but a "FileContentResult". The great thing about "FileContentResult" is that it allows us to return bytes from the controller.

In this example, I am converting bytes from a string using Encoding.UTF8.GetBytes() method. Ideal for what we need to generate a robots.txt file.

[Route("robots.txt")]
[Route("Robots.txt")]
public FileContentResult Robots()
{
    StringBuilder robotsEntries = new StringBuilder();
    robotsEntries.AppendLine("User-agent: *");

    //If the website is in debug mode, then set the robots.txt file to not index the site.
    if (System.Web.HttpContext.Current.IsDebuggingEnabled)
    {
        robotsEntries.AppendLine("Disallow: /");
    }
    else
    {
        robotsEntries.AppendLine("Disallow: /Error");
        robotsEntries.AppendLine("Disallow: /resources");
        robotsEntries.AppendLine("Sitemap: http://www.surinderbhomra.com/sitemap.xml");
    }

    return File(Encoding.UTF8.GetBytes(robotsEntries.ToString()), "text/plain");
}

RouteConfig

Since I add my routing at controller level, I add the "MapMvcAttributeRoutes" method to the RouteConfig.cs file. But if you prefer to add your routes directly here, then this method can be removed.

public class RouteConfig
{
    public static void RegisterRoutes(RouteCollection routes)
    {
        routes.IgnoreRoute("{resource}.axd/{*pathInfo}");

        routes.MapMvcAttributeRoutes(); //Add this line!

        routes.MapRoute(
            name: "Default",
            url: "{controller}/{action}/{id}",
            defaults: new { controller = "Home", action = "Index", id = UrlParameter.Optional }
        );
    }
}

Web.config

Add a "runAllManagedModulesForAllRequests" attribute to the modules tag in the web.config file to allow our robot.txt text file to be rendered by ASP.NET.

<system.webserver>
  <modules runallmanagedmodulesforallrequests="true"></modules>
</system.webserver>