Version: 3.1

Crawl a sitemap

This example downloads and crawls the URLs from a sitemap.

Cheerio Crawler
Puppeteer Crawler
Playwright Crawler

Using CheerioCrawler:

import { Actor } from 'apify';
import { CheerioCrawler, downloadListOfUrls } from 'crawlee';

await Actor.init();

const crawler = new CheerioCrawler({
    // Function called for each URL
    async requestHandler({ request }) {
        console.log(request.url);
    },
    maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap)
});

const listOfUrls = await downloadListOfUrls({
    url: 'https://apify.com/sitemap.xml',
});

// Run the crawler
await crawler.run(listOfUrls);

await Actor.exit();

Using PuppeteerCrawler:

tip

To run this example on the Apify Platform, select the apify/actor-node-puppeteer-chrome image for your Dockerfile.

Run on

import { Actor } from 'apify';
import { downloadListOfUrls, PuppeteerCrawler } from 'crawlee';

await Actor.init();

const crawler = new PuppeteerCrawler({
    // Function called for each URL
    async requestHandler({ request }) {
        console.log(request.url);
    },
    maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap)
});

const listOfUrls = await downloadListOfUrls({
    url: 'https://apify.com/sitemap.xml',
});

// Run the crawler
await crawler.run(listOfUrls);

await Actor.exit();

Using PlaywrightCrawler:

tip

To run this example on the Apify Platform, select the apify/actor-node-playwright-chrome image for your Dockerfile.

Run on

import { Actor } from 'apify';
import { downloadListOfUrls, PlaywrightCrawler } from 'crawlee';

await Actor.init();

const crawler = new PlaywrightCrawler({
    // Function called for each URL
    async requestHandler({ request }) {
        console.log(request.url);
    },
    maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap)
});

const listOfUrls = await downloadListOfUrls({
    url: 'https://apify.com/sitemap.xml',
});

// Run the crawler
await crawler.run(listOfUrls);

await Actor.exit();