Crawl a sitemap
This example downloads and crawls the URLs from a sitemap.
- Cheerio Crawler
- Puppeteer Crawler
- Playwright Crawler
Using CheerioCrawler
:
import { Actor } from 'apify';
import { CheerioCrawler, downloadListOfUrls } from 'crawlee';
await Actor.init();
const crawler = new CheerioCrawler({
// Function called for each URL
async requestHandler({ request }) {
console.log(request.url);
},
maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap)
});
const listOfUrls = await downloadListOfUrls({ url: 'https://apify.com/sitemap.xml' });
// Run the crawler
await crawler.run(listOfUrls);
await Actor.exit();
Using PuppeteerCrawler
:
tip
To run this example on the Apify Platform, select the apify/actor-node-puppeteer-chrome
image for your Dockerfile.
import { Actor } from 'apify';
import { PuppeteerCrawler, downloadListOfUrls } from 'crawlee';
await Actor.init();
const crawler = new PuppeteerCrawler({
// Function called for each URL
async requestHandler({ request }) {
console.log(request.url);
},
maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap)
});
const listOfUrls = await downloadListOfUrls({ url: 'https://apify.com/sitemap.xml' });
// Run the crawler
await crawler.run(listOfUrls);
await Actor.exit();
Using PlaywrightCrawler
:
tip
To run this example on the Apify Platform, select the apify/actor-node-playwright-chrome
image for your Dockerfile.
import { PlaywrightCrawler, downloadListOfUrls } from 'crawlee';
import { Actor } from 'apify/src';
await Actor.init();
const crawler = new PlaywrightCrawler({
// Function called for each URL
async requestHandler({ request }) {
console.log(request.url);
},
maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap)
});
const listOfUrls = await downloadListOfUrls({ url: 'https://apify.com/sitemap.xml' });
// Run the crawler
await crawler.run(listOfUrls);
await Actor.exit();