Basic crawler
This is the most bare-bones example of the Apify SDK, which demonstrates some of its building blocks such as the BasicCrawler
. You probably don't need to go this deep though, and it would be better to start with one of the full-featured crawlers
like CheerioCrawler
or PlaywrightCrawler
.
The script simply downloads several web pages with plain HTTP requests using the got-scraping
npm package and stores their raw HTML and URL in the default dataset. In local configuration, the data will be stored as JSON files in
./storage/datasets/default
.
import { Actor } from 'apify';
import { BasicCrawler } from 'crawlee';
import { gotScraping } from 'got-scraping';
await Actor.init();
// Create a dataset where we will store the results.
// Create a BasicCrawler - the simplest crawler that enables
// users to implement the crawling logic themselves.
const crawler = new BasicCrawler({
// This function will be called for each URL to crawl.
async requestHandler({ request }) {
const { url } = request;
console.log(`Processing ${url}...`);
// Fetch the page HTML via Apify utils gotScraping
const { body } = await gotScraping({ url });
// Store the HTML and URL to the default dataset.
await Actor.pushData({
url: request.url,
html: body,
});
},
});
// The initial list of URLs to crawl. Here we use just a few hard-coded URLs.
await crawler.run([
{ url: 'http://www.google.com/' },
{ url: 'http://www.example.com/' },
{ url: 'http://www.bing.com/' },
{ url: 'http://www.wikipedia.com/' },
]);
console.log('Crawler finished.');
await Actor.exit();