How to scrape from sitemaps

Processing sitemaps automatically with Crawlee

Crawlee allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code.

import { RobotsFile } from 'crawlee';

const robots = await RobotsFile.find('https://www.mysite.com');

const allWebsiteUrls = await robots.parseUrlsFromSitemaps();

The sitemap.xml file is a jackpot for every web scraper developer. Take advantage of this and learn an easier way to extract data from websites using Crawlee.

Let's say we want to scrape a database of craft beers (brewbound.com) before summer starts. If we are lucky, the website will contain a sitemap at brewbound.com/sitemap.xml.

Check out Sitemap Sniffer, which can discover sitemaps in hidden locations!

Analyzing the sitemap

The sitemap is usually located at the path /sitemap.xml. It is always worth trying that URL, as it is rarely linked anywhere on the site. It usually contains a list of all pages in XML format.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>http://www.brewbound.com/advertise</loc>
        <lastmod>2015-03-19</lastmod>
        <changefreq>daily</changefreq>
    </url>
    <url>
    ...

The URLs of breweries take this form:

http://www.brewbound.com/breweries/[BREWERY_NAME]

And the URLs of craft beers look like this:

http://www.brewbound.com/breweries/[BREWERY_NAME]/[BEER_NAME]

They can be matched using the following regular expression:

http(s)?:\/\/www\.brewbound\.com\/breweries\/[^\/]+\/[^\/<]+

Note the two parts of the regular expression [^\/<] containing the < symbol. This is because we want to exclude the </loc> tag, which closes each URL.

Scraping the sitemap in Crawlee

If you're scraping sitemaps (or anything else, really), Crawlee is perfect for the job.

First, let's add the beer URLs from the sitemap to the RequestList using our regular expression to match only the (craft!!) beer URLs and not pages of breweries, contact page, etc.

const requestList = await RequestList.open(null, [{
    requestsFromUrl: 'https://www.brewbound.com/sitemap.xml',
    regex: /http(s)?:\/\/www\.brewbound\.com\/breweries\/[^/<]+\/[^/<]+/gm,
}]);

Now, let's use PuppeteerCrawler to scrape the created RequestList with Puppeteer and push it to the final dataset.

const crawler = new PuppeteerCrawler({
    requestList,
    async requestHandler({ page }) {
        const beerPage = await page.evaluate(() => {
            return document.getElementsByClassName('productreviews').length;
        });
        if (!beerPage) return;

        const data = await page.evaluate(() => {
            const title = document.getElementsByTagName('h1')[0].innerText;
            const [brewery, beer] = title.split(':');
            const description = document.getElementsByClassName('productreviews')[0].innerText;

            return { brewery, beer, description };
        });

        await Dataset.pushData(data);
    },
});

Full code

If we create a new Actor using the code below on the Apify platform, it returns a nicely formatted spreadsheet containing a list of breweries with their beers with descriptions.

Make sure to use the apify/actor-node-puppeteer-chrome image for your Dockerfile, otherwise the run will fail.

Run on

import { Dataset, PuppeteerCrawler, RequestList } from 'crawlee';

const requestList = await RequestList.open(null, [{
    requestsFromUrl: 'https://www.brewbound.com/sitemap.xml',
    regex: /http(s)?:\/\/www\.brewbound\.com\/breweries\/[^/<]+\/[^/<]+/gm,
}]);

const crawler = new PuppeteerCrawler({
    requestList,
    async requestHandler({ page }) {
        const beerPage = await page.evaluate(() => {
            return document.getElementsByClassName('productreviews').length;
        });
        if (!beerPage) return;

        const data = await page.evaluate(() => {
            const title = document.getElementsByTagName('h1')[0].innerText;
            const [brewery, beer] = title.split(':');
            const description = document.getElementsByClassName('productreviews')[0].innerText;

            return { brewery, beer, description };
        });

        await Dataset.pushData(data);
    },
});

await crawler.run();