404 VS 410—THE TECHNICAL SEO EXPERIMENT

SEO
Content Performance & Optimisation

For years, Google spokespersons have insisted there are small differences in the way 404 and 410 URLs are treated. Despite this, it’s always been unclear how they are really handled.

In fact, Google’s own John Mueller stated that using either 404 or 410 response codes for deleted pages is fine, but 410 response codes can be seen as “more permanent” and work faster than 404 ones. 

This means that Google could remove a 410 page from the index sooner than a 404 page. 

 

 

With limited data available on this topic, the SEO geeks at Reboot wanted to put the two response codes to the test to see exactly how Google handles them. 

 

Hypertext Transfer Protocol (HTTP) & Response Status Codes

Before we dive into our latest SEO experiment, a quick overview of HTTP and response status codes might be useful.

Tim Berners-Lee, the British computer scientist who is regarded as the inventor of the World Wide Web, introduced HTTP as a web standard in 1991. These days, virtually all web traffic uses HTTP. 

Although it has evolved since the early 90s, it still forms a core part of the World Wide Web Foundation. It not only helps us access and experience websites, but the protocol is also used by search engines when creating their index, and by developers for a wide range of projects and applications. 

 

Response Status Codes

An HTTP response status code indicates to a user whether a request was completed successfully, and if not, gives information as to why it wasn’t. 

There are a whole host of status codes we could have explored, but we chose to focus on 404 and 410 for this experiment. 

 

404 Response Code

 

You may come across the 404 response code fairly often when browsing online, as it indicates that the server wasn't able to locate and/or return the requested page.

Although your browser will not recognise the requested URL, and therefore return a 404 page (as pictured), a good SEO company will encourage you to take advantage of this page to direct a user to important pages, via hyperlinks, to avoid losing that user entirely. 

 

410 Response Code

 

The 410 response code is slightly different—this is used when the requested content has been permanently removed or deleted from the server with no forwarding or redirect address.

 

Why 404s and 410s Matter

When building and maintaining a website, you will have to remove obsolete pages. If you want to delete them, you must return a number of different response codes to inform users and search engines that the page(s) no longer exist.

Assuming these unwanted pages aren’t being redirected to another relevant URL, developers will choose to return either a 404 or 410 response code when a page is intentionally deleted. 

But, which one is more effective? If we go by John Mueller’s comments, a 410 response could see the deleted content being removed from the index faster, and being recrawled less frequently. 

This would be particularly handy for hacked websites to remove spam pages and larger sites that have tight crawl budgets. 

A page returning a 404 response code, however, may suggest it’s a temporary error and search engines will need to recrawl the page to update their index once the mistake has been fixed.

 

Hypothesis

With this in mind, we proposed that deleted URLs that returned a 410 response code would be removed from Google’s index quicker and be recrawled less frequently than those pages returning a 404 response code. 

 

Methodology

In order to effectively test this hypothesis, we needed to observe and measure how often the Googlebot returned to crawl a sample of 410 URLs compared to a sample of 404 ones—and how quickly the URLs in each sample were removed from the index. 

 

Experiment Methodology

1. Identify two test websites with active, indexed URLs.

2. Access log files and begin archiving.

3. Use Google Search Console API for hourly checks to detect if (and when) the targeted sample URLs remain indexed, and when Google last crawled them.

4. Crawl the test websites and identify the indexed URLs without external links.

5. Within that sample, select URLs either without internal links or with consistent internal links (the same number of internal links from the exact same pages).

6. Set half of the indexed URL sample to 404, and the other half to 410.

7. Observe how frequently Google recrawls the various URLs.

8. Observe how quickly the URLs return a Coverage State from the GSC API that suggests they are no longer indexed.

10. Compare data collected from the 404 sample URLs and the 410 ones.

*Disclaimer: During the experiment, it became clear we could not accurately track indexing, so this element was removed from the study. 

 

Of course, there are always outside factors that can influence the results. Below is a rundown of the most important ones we identified and how we sought to mitigate them.

 

Minimising Outside Factors

External Links
Google finds new (and old) content to crawl by following links found on web pages, and it is widely accepted that websites with a greater number of external links will be crawled more frequently. 

In order to minimise the influence of external links on our results, we chose URLs in our samples that had—at the time of checking—no external links pointing to them. 
 

Internal Links
Search engines also use internal linking to identify content to crawl—or recrawl. And similarly to external links, URLs with more internal links are likely to be found and crawled more frequently as a result. 

We once again selected and monitored URLs that had no internal links at all—or the exact same number of internal links. 

To generate a list of URLs that met these criteria, we used the Screaming Frog SEO spider tool. We then were able to run comparisons across like-for-like pages. 
 

Experiment Site Changes
It was very important that no changes occurred to the two test sites used in the experiment, as this could easily impact the Googlebot crawling them. 

No one outside of the SEO and web development teams at Reboot knew about the experiment sites, so we were able to confirm that no projects related to them. 
 

Speed & Usability
Initially, we had planned to use each test site for one of the response codes. However, we realised this approach would introduce factors relating to speed and usability that would impact their crawlability. Instead, we set 50% of sample URLs to return 404s and 50% to return 410s on each domain.

In theory, this would mean that any positive or negative UX or speed issues impacting either domain would affect the URLs on each site in the same way. 
 

Crawl Rates 
It was important to check if there were any differences in the crawl rates of our sample URLs before updating them to return either response code. Doing this allowed us to spot early on if any of the URLs we were planning to set to 410 were already being crawled less frequently. 

Our checks not only showed that this wasn’t the case, but that the 410 same URLs were being crawled more frequently than the 404 ones. 

 

Key Obstacles

As this is the first experiment of this kind that we’ve run, we inevitably ran into some obstacles. To ensure validity and accuracy, we’ve detailed these hurdles below and explained how we tackled them. 

 

Log Files
Our plan was to collect and save the log files of both sites throughout the experiment to allow us to cross compare any data from the Google Search Console API with our own server data. 

Unfortunately, a configuration error meant the log files were only being saved and stored for a 14-day rolling period. This meant that after two weeks of the experiment we no longer had accurate data for the first day of the test. 

Once we’d fixed the configuration to save all weblog events, we compared the last crawled time in the Google Search Console API data with the weblog data to confirm they matched—which they did. Following this, we carried out random spot checks to ensure all data matched correctly. 

 

Google Search Console ‘Coverage State’
In order to measure and observe how frequently the Googlebot crawled the sample URLs—and how quickly they were removed from the index—we were pulling the ‘CoverageState’ of each URL from GSC.

We had anticipated that these URLs, once removed from the index, would be labelled ‘crawled, currently not indexed’ or similar. 

However, a few days into the experiment we discovered the removed URLs were showing ‘Not Found (404)’ regardless of their response codes—and didn’t tell us whether they were still indexed or not. 

A manual check with a site command showed that the URLs marked as a 404 error were still in the index—meaning the API wasn’t going to tell us if they’d been removed. 

With this in mind, we decided to focus the entirety of the experiment on the crawl frequency instead. 

 

Results

We observed the test sites for more than three months before it became apparent that we had more than enough data to show how frequently the URLs were being crawled to draw insights. 

We passed this over to our data team lead, Ali Hassan, who found that the data does show that 404 URLs are crawled more frequently than 410s. 

An analysis of the Google Search Console API data looking at our sample of 119 test web pages shows that 404's are, on average, crawled 49.6% more often than 410s."

To be certain, we then broke down the sample results by internal link counts. 

Ali added: 

"This trend is also consistent amongst strata of our sample grouped on the basis of internal backlink counts. A T-test confirms the statistical significance of our overall results to within 95% confidence.

While both analyses confirm our initial hypothesis, the latter lacks sufficient sample sizes for conclusive results to be ascertained for each stratum. The overall trend, however, is indicative.”

Director of SEO at Shopify, Kevin Indig, took an early look at the write up and results and shared his thoughts:

"410 are treated like 301s and 404 like 302s. It makes sense that 404s are more often recrawled by Google because Google expects that the 404 turns into a 200 or 410 again."

 

Summary

Although it was a challenging experiment to run, the unforeseen obstacles that occurred will help us better plan our methodologies for future experiments. 

A 49.6% difference in crawling the 404 URLs and the 410 ones is both interesting and conclusive, suggesting that if you want Google to recrawl a removed URL as infrequently as possible, you should set a 410 response code.