E-commerce optimization: make better use of the budget crawl

Optimizing an e-commerce by working on the robots.txt file to make the most of the crawl budget, this is in a nutshell the job I want to present in this post.

When I do SEO optimization for e-commerce there is an activity that gives me particular satisfaction and it is precisely the configuration of the robots file to optimize the
crawl budget, an activity that serves both e-commerce businesses that have thousands of pages and to new ones that have few pages but a still small crawl budget.

 

The problem of e-commerce sites

One of the big problems that e-commerce sites have is the automatic and uncontrolled generation of URLs due to filters, wishilists, send to a friend, etc., all of which are excellent tools for the user but have the flaw of generating an infinite number of of useless and duplicate pages.

The consequence is that search engines find themselves scanning thousands of useless pages that have no use for positioning purposes as they offer no added value to users looking for a product, they are only useful to the user who independently decides to use the filter or wishlist.

The final outcome is a waste of a crawl budget that can lead to the inability of the crawlers to scan the important pages of the shop as products and categories, with the result of perhaps being indexed only 1,000 pages against a catalog of 5,000 products.

 

SEO for e-commerce: configure the robots.txt file

The attached image shows the response codes detected by Screaming Frog in a scan that simulates Googlebot, before and after getting their hands on the robots file.

In the left part of the image we see a preponderance of the red zone, hundreds of thousands of pages without response code. These are just the url generated automatically by the innumerable combinations of filters, an enormous quantity compared to the small green portion that represents the real and relevant contents (pages and images).

The circular graph clearly indicates the seriousness of the problem: 90% of the pages scanned by the crawler are useless for positioning purposes.

The second graph with preponderance of the green zone is instead the result of the same site analyzed after the configuration of the robots file with a very different outcome: 99% of the scanned pages respond with a code 200 which means OK.
The work done allowed us to

  • reduce the number of scanned urls
  • concentrate the work of the crawlers on the important pages
  • make better use of the available crawl budget.

It should be considered that in the first scan a robots file with some basic settings was already active, it was not exactly at a zero level!

Another interesting fact is the time taken to complete the scan, performed with the same computer and with the same connection:

  • 18 hours in the first case
  • 3 hours in the second case

about 84% less time for the same website!