Duplicate content is one of the topics where the SEO community has built a body of folklore that does not match what Google actually does. The folklore says Google penalises duplicate content. It does not. The folklore says small amounts of duplication are dangerous. They are usually invisible. The folklore says you can be hit by a duplicate content penalty without warning. You cannot.
What is true is more nuanced. Google does treat duplicate content as a real signal, and the consequences shape rankings on most sites. But the mechanism is selection, not punishment. Understanding that distinction changes how to think about the problem and the fixes.
What duplicate content actually is
Duplicate content is any case where the same or substantially the same words appear at more than one URL. The most common forms are technical, often accidental, and usually invisible until someone looks for them.
- The same page accessible at both http and https versions.
- The same page accessible at both www and non-www.
- The same page with and without a trailing slash.
- The same page with various URL parameters appended for tracking, filtering, or session management.
- Faceted-navigation pages where filter combinations produce near-identical results.
- Pagination patterns where page two and page three of a category show mostly the same content as page one.
- Tag and archive pages that show excerpts of articles already published as their own pages.
- Syndicated articles republished on partner sites.
- Product descriptions copied verbatim across multiple retailer pages.
Near-duplication, where pages are 70 to 95 per cent the same, behaves identically to exact duplication for SEO purposes. So does cross-language duplication if the translation is mostly automated and not editorially distinct.
The three costs of duplicate content
Crawl waste. Every duplicate URL consumes a fetch from your crawl budget. On a small site this rarely matters. On a site with thousands of URLs and substantial parameter duplication, it can become the constraint that decides which content gets crawled at all. The wider view of how crawl budget gets allocated sits in crawl budget for SEO.
Diluted link equity. When external sites link to a page, they sometimes link to one version and sometimes link to another. If a hundred backlinks point to five different URL variants of the same article, no single URL has the full hundred backlinks behind it. Each version has a fraction, and each fraction is weaker than the whole. The canonical signal consolidates this; the absence of one leaves the equity scattered.
The wrong URL ranks. The most operationally painful of the three. Google picks a version to rank based on what it judges to be the most authoritative, which is not always the version you wanted. The wrong-URL outcome breaks tracking (because conversions happen on the URL the user landed on, not the one you watch), breaks reporting (because the canonical URL in your analytics shows zero traffic), and breaks user experience (because the URL ranked may be the one with tracking parameters appended, the unsecured http version, or the one that has since been redirected).
None of these is a penalty. None of them shows in Search Console as a manual action. All of them quietly cost rankings and reporting until somebody runs the diagnostic.
Where duplicate content comes from on most sites
The same five or six patterns produce most of the duplicate content on most sites. Knowing them speeds up diagnosis.
URL parameters. Tracking parameters like ?utm_source=email, session IDs, sort orders, and pagination markers all create new URLs that contain mostly the same content. Most CMS platforms add parameters by default for filters and faceted navigation, and the result is often hundreds or thousands of variant URLs per category page.
Protocol and subdomain inconsistency. If both http://example.com and https://example.com resolve and return the same page, or if both www.example.com and example.com do, you have at least one set of duplicate URLs for every page on your site. The 301 redirect chain that consolidates these is usually a one-time fix at the server level.
Trailing slash inconsistency. The URLs example.com/page and example.com/page/ are different URLs from Google's perspective. Most modern servers handle this with a redirect, but plenty of older configurations do not.
Print-friendly and AMP versions. Older sites that maintained separate print stylesheets or AMP versions still occasionally serve those as distinct URLs. Each one is a duplicate of the original.
Faceted navigation. Product category pages that allow users to filter by size, colour, brand, and price typically generate a unique URL for every combination of filter choices. Eight facets with three options each produces over six thousand URL variations per category before adding sort orders.
Pagination. Category page two, page three, and page four show different sets of products but with mostly the same template, header, sidebar, and footer content. Google often consolidates these but not always; clean pagination patterns make the consolidation more reliable.
The diagnostic flow when these patterns produce indexation problems is in how to fix indexing problems in Google Search Console, which covers the Search Console reports that surface duplicate content issues.
Cross-domain duplicate content: syndication and scraping
Duplication across sites you do not control needs a different approach. Three common cases.
The first is legitimate syndication. You publish an article, a partner republishes it. Both versions exist with permission. The version with more authority signals tends to win Google's canonical selection. If the partner has more authority than you, your original may get demoted in favour of theirs. The fix is for the syndicating partner to include a rel=canonical tag pointing back to your URL. If they refuse, an attribution link with clear source language is the next-best signal.
The second is scraped content. A scraper copies your article verbatim without permission. The right response depends on the scale and authority of the scraping site. For low-authority sites, ignoring is usually best; Google handles low-authority duplicates competently. For high-authority sites that scrape, a DMCA takedown notice through Google's removal request system is the formal channel.
The third is product description duplication on e-commerce. Manufacturers provide standard product descriptions that hundreds of retailers paste into their product pages. From Google's perspective, every retailer is showing the same page. Retailers that win in this scenario invest in original product descriptions, original photography, and original content additions (sizing guides, comparison tables, customer questions). The investment is rarely cheap, but it is what differentiates the e-commerce sites that rank from the ones that do not.
The fix hierarchy, in priority order
The fix order matters. Each option signals something different to Google and has different trade-offs.
- Choose the canonical URL. The decision precedes any technical implementation. For each page, which URL do you want to rank? Resolve protocol, subdomain, slash, and parameter questions at the strategic level first.
- Use a rel=canonical tag. The least disruptive option. Add a
<link rel="canonical" href="...">tag in the HTML head of every duplicate URL pointing to the chosen canonical. Both URLs remain accessible to users; Google treats the canonical as the version to index. The canonical signal is a strong hint but not binding; Google can ignore it if other signals suggest a different URL is actually canonical. The mechanics are unpacked in robots.txt and canonical tags. - Use a 301 redirect. When the duplicate URL has no reason to exist as a separate destination. The 301 permanently consolidates the duplicate into the canonical, transferring link equity in the process. Use this for protocol fixes (http to https), subdomain fixes (non-www to www or vice versa), and trailing-slash consolidation.
- Use a noindex meta tag. When the page should exist for users but should never appear in search results. Useful for thank-you pages, internal search results, and filtered category pages that you want crawlable but not indexable. Noindex prevents indexing but still costs a crawl budget fetch.
- Use parameter handling in Search Console. For URL parameters at scale, telling Google in Search Console how to treat each parameter (ignore, track, narrow, sort) is more efficient than canonicalising every variant. The feature is less prominent in modern Search Console but still functional.
- Use robots.txt disallow. The most aggressive option. The URL is not crawled at all. Use only when the duplicate URL should not be indexed AND should not be crawled, because robots.txt blocks prevent Google from even seeing the canonical tag if one exists. Mistakes here can be painful.
The hierarchy reflects an underlying principle: signal as gently as the situation allows. A canonical tag handles most cases. A redirect handles the cases where the duplicate URL should not exist. The harder controls (noindex, parameter handling, robots.txt) are for the cases that the first two cannot solve.
Myths worth retiring
A few duplicate-content beliefs that get repeated even though Google has publicly contradicted them.
- "Google penalises duplicate content." No. Google selects from duplicates; it does not penalise the site that hosts them.
- "You need to rewrite every product description." Only if the existing duplication is costing you ranking. Most small retailers do not need to invest here; some large ones absolutely do.
- "Tag and archive pages must be noindexed." Depends. Tag pages with thin content (a list of titles only) are usually best noindexed. Tag pages curated with genuine commentary can rank usefully on their own.
- "Internal duplication is worse than external duplication." The reverse is closer to true. Internal duplication is fixable from the inside; external duplication requires another site to cooperate.
The honest version of duplicate content
The cost of duplicate content is real, but it is not the cost most SEO folklore describes. Google does not punish sites for duplication. It picks a version to rank. The job of a clean SEO programme is to make that choice explicit, so the URL that ranks is the one you wanted to rank. The fixes are mostly one-time decisions at the architecture level: which protocol, which subdomain, which canonical, which parameter rules. Once those are set, the maintenance burden is small.
Our Bangkok SEO agency handles duplicate content audits as part of every technical engagement, because the issues compound silently and rarely surface in standard reporting. Our technical SEO services in Thailand include the URL-universe audit that surfaces every duplication pattern at scale. An SEO specialist in Thailand can walk through the Search Console diagnostics and the fix hierarchy in less time than reading this post took.
Common questions
Does Google penalise duplicate content?
No. Google has stated publicly more than once that there is no duplicate content penalty in the algorithm, and there never has been. What does happen is that when Google finds the same content at multiple URLs, it picks one version to index and rank, then largely ignores the others. The cost is real but it works through this selection process rather than through punishment. If Google picks the version you wanted to rank, no harm is done. If Google picks a different version, your backlinks pointing to the intended version get partly wasted, your tracking points to a URL nobody is visiting, and the wrong version may show in search results.
What counts as duplicate content?
Duplicate content is any case where the same or substantially the same words appear at more than one URL. The most common forms are technical: the same page accessible at both http and https versions, both www and non-www, with and without a trailing slash, and with various URL parameters appended for tracking or filtering. Less obvious forms include syndicated articles, product descriptions copied across many retailer pages, faceted-navigation pages where filter combinations produce near-identical content, and tag or archive pages that show excerpts of articles already published elsewhere on the site.
How do I fix duplicate content?
The fix depends on the type of duplication and what you want to achieve, but the priority order is consistent. First, choose which URL should be the canonical version. Then signal that choice to Google using the most appropriate of: a rel=canonical tag, a 301 redirect, a noindex tag, parameter handling in Search Console, or a robots.txt disallow. Each fix has trade-offs; canonicals are non-binding suggestions, 301s are permanent, noindex tags are obeyed but still cost crawl budget.
Is syndicated content bad for SEO?
Not inherently, but it has to be handled correctly. If you publish an article on your site and a partner republishes it, both versions are duplicate content from Google's perspective. The version with more authority signals usually wins the canonical selection, which may or may not be your version. The fix is for the syndicating partner to include a rel=canonical tag pointing back to your original. If the partner refuses, you should at least negotiate a clear attribution link.