Edit

Scrape websites using the sitemap

The sitemap.xml file is a jackpot for every web scraper. Take advantage of this and learn a much easier way to extract data from websites using the Apify SDK.

Let's say we want to scrape a database of craft beers (brewbound.com) before the summer season starts. If we are lucky, the website will contain a sitemap at https://www.brewbound.com/sitemap.xml.

Check out out Sitemap Sniffer tool, which can discover sitemaps in hidden locations.

The sitemap

The sitemap is usually located at the path /sitemap.xml. It is always worth trying that URL, as it is rarely linked anywhere on the site. It usually contains a list of all pages in XML format.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>http://www.brewbound.com/advertise</loc>
        <lastmod>2015-03-19</lastmod>
        <changefreq>daily</changefreq>
    </url>
    <url>
    ...

The URLs of breweries are in the form

http://www.brewbound.com/breweries/[BREWERY_NAME]

and the URLs of craft beers are in the form

http://www.brewbound.com/breweries/[BREWERY_NAME]/[BEER_NAME]

They can be matched with the following regular expression (regex).

/http(s)?:\/\/www\.brewbound\.com\/breweries\/[^\/]+\/[^\/<]+/gm

Note the two parts of the regular expression [^\/<] containing <. This is because we want to exclude the </loc> tag, which closes each URL.

Using the sitemap in Apify SDK

Our web scraping and automation library is well-suited for scraping with sitemaps.

First, let's import the beer URLs from the sitemap to RequestList using our regular expression to match only the (craft!) beer URLs and not pages of breweries, contact page, etc.

const requestList = await new Apify.RequestList({
    sources: [{
        requestsFromUrl: 'https://www.brewbound.com/sitemap.xml',
        regex: /http(s)?:\/\/www\.brewbound\.com\/breweries\/[^\/]+\/[^\/<]+/gm,
    }],
});

await requestList.initialize();

Now, let's use PuppeteerCrawler to scrape the created RequestList with Puppeteer and push it to the final dataset.

const crawler = new Apify.PuppeteerCrawler({
    requestList,
    handlePageFunction: async ({ page, request }) => {
        const data = await page.evaluate(() => {
            const title = document.getElementsByTagName('h1')[1].innerText;
            const [brewery, beer] = title.split(':');
            const description = document.getElementsByClassName('productreviews')[0].innerText;

            return { brewery, beer, description };
        });

        await Apify.pushData(data);
    },
});

await crawler.run();

Full code example

If we create a new actor using the code below on the Apify platform, it returns a nicely formatted spreadsheet containing a list of breweries with their beers and descriptions.

Make sure to select the Node.js 12 + Chrome on Debian (apify/actor-node-chrome) base image, otherwise the run will fail.

const Apify = require('apify');

Apify.main(async () => {
    const requestList = await new Apify.RequestList({
        sources: [{
            requestsFromUrl: 'https://www.brewbound.com/sitemap.xml',
            regex: /http(s)?:\/\/www\.brewbound\.com\/breweries\/[^\/]+\/[^\/<]+/gm,
        }],
    });

    await requestList.initialize();

    const crawler = new Apify.PuppeteerCrawler({
        requestList,
        handlePageFunction: async ({ page, request }) => {
            const data = await page.evaluate(() => {
                const title = document.getElementsByTagName('h1')[1].innerText;
                const [brewery, beer] = title.split(':');
                const description = document.getElementsByClassName('productreviews')[0].innerText;

                return { brewery, beer, description };
            });

            await Apify.pushData(data);
        },
    });

    await crawler.run();
});