Crawling sitemaps

In the previous lesson, we learned what is the utility (and dangers) of crawling sitemaps. In this lesson, we will go in-depth to how to crawl sitemaps.

We will look at the following topics:

How to find sitemap URLs
How to set up HTTP requests to download sitemaps
How to parse URLs from sitemaps
Using Crawlee to get all URLs in a few lines of code

How to find sitemap URLs

Sitemaps are commonly restricted to contain a maximum of 50k URLs so usually, there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in robots.txt and/or have auto-incremented URLs like /sitemap1.xml, /sitemap2.xml, etc.

Google

You can try your luck on Google by searching for site:example.com sitemap.xml or site:example.com sitemap.xml.gz and see if you get any results. If you do, you can try to download the sitemap and see if it contains any useful URLs. The success of this approach depends on the website telling Google to index the sitemap file itself which is rather uncommon.

robots.txt

If the website has a robots.txt file, it often contains sitemap URLs. The sitemap URLs are usually listed under Sitemap: directive.

Common URL paths

You can check some common URL paths, such as the following:

/sitemap.xml /product_index.xml /product_template.xml /sitemap_index.xml /sitemaps/sitemap_index.xml /sitemap/product_index.xml /media/sitemap.xml /media/sitemap/sitemap.xml /media/sitemap/index.xml

Make also sure you test the list with .gz, .tar.gz and .tgz extensions and by capitalizing the words (e.g. /Sitemap_index.xml.tar.gz).

Some websites also provide an HTML version, to help indexing bots find new content. Those include:

/sitemap /category-sitemap /sitemap.html /sitemap_index

Apify provides the Sitemap Sniffer, an open source actor that scans the URL variations automatically for you so that you don't have to check them manually.

How to set up HTTP requests to download sitemaps

For most sitemaps, you can make a single HTTP request and parse the downloaded XML text. Some sitemaps are compressed and have to be streamed and decompressed. The code can get fairly complicated, but scraping frameworks, such as Crawlee, can do this out of the box.

How to parse URLs from sitemaps

Use your favorite XML parser to extract the URLs from inside the <loc> tags. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). For specific code examples, see our Node.js guide.

Using Crawlee

Fortunately, you don't have to worry about any of the above steps if you use Crawlee, a scraping framework, which has rich traversing and parsing support for sitemap. It can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all the URLs in a few lines of code:

import { RobotsFile } from 'crawlee';

const robots = await RobotsFile.find('https://www.mysite.com');

const allWebsiteUrls = await robots.parseUrlsFromSitemaps();

Next up

That's all we need to know about sitemaps for now. Let's dive into a much more interesting topic - search, filters, and pagination.