BasicCrawler
Provides a simple framework for parallel crawling of web pages. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites.
BasicCrawler
is a low-level tool that requires the user to implement the page download and data extraction functionality themselves. If you want a
crawler that already facilitates this functionality, please consider using CheerioCrawler
,
PuppeteerCrawler
or PlaywrightCrawler
.
BasicCrawler
invokes the user-provided BasicCrawlerOptions.handleRequestFunction
for
each Request
object, which represents a single URL to crawl. The Request
objects are fed from the
RequestList
or the RequestQueue
instances provided by the
BasicCrawlerOptions.requestList
or
BasicCrawlerOptions.requestQueue
constructor options, respectively.
If both BasicCrawlerOptions.requestList
and
BasicCrawlerOptions.requestQueue
options are used, the instance first processes URLs from the
RequestList
and automatically enqueues all of them to RequestQueue
before it starts their
processing. This ensures that a single URL is not crawled multiple times.
The crawler finishes if there are no more Request
objects to crawl.
New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the
AutoscaledPool
class. All AutoscaledPool
configuration options can be passed to the
autoscaledPoolOptions
parameter of the BasicCrawler
constructor. For user convenience, the minConcurrency
and maxConcurrency
AutoscaledPool
options are available directly in the BasicCrawler
constructor.
Example usage:
// Prepare a list of URLs to crawl
const requestList = new Apify.RequestList({
sources: [{ url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }],
});
await requestList.initialize();
// Crawl the URLs
const crawler = new Apify.BasicCrawler({
requestList,
handleRequestFunction: async ({ request }) => {
// 'request' contains an instance of the Request class
// Here we simply fetch the HTML of the page and store it to a dataset
const { body } = await Apify.utils.requestAsBrowser(request);
await Apify.pushData({
url: request.url,
html: body,
});
},
});
await crawler.run();
Properties
stats
Type: Statistics
Contains statistics about the current run.
requestList
Type: RequestList
A reference to the underlying RequestList
class that manages the crawler's Request
s. Only available if
used by the crawler.
requestQueue
Type: RequestQueue
A reference to the underlying RequestQueue
class that manages the crawler's Request
s. Only available if
used by the crawler.
sessionPool
Type: SessionPool
A reference to the underlying SessionPool
class that manages the crawler's Session
s. Only available if
used by the crawler.
autoscaledPool
Type: AutoscaledPool
A reference to the underlying AutoscaledPool
class that manages the concurrency of the crawler. Note that this property is
only initialized after calling the BasicCrawler.run()
function. You can use it to change the concurrency settings on the
fly, to pause the crawler by calling AutoscaledPool.pause()
or to abort it by calling
AutoscaledPool.abort()
.
new BasicCrawler(options)
Parameters:
options
:BasicCrawlerOptions
- AllBasicCrawler
parameters are passed via an options object.
basicCrawler.log
basicCrawler.sessionPoolOptions
basicCrawler.run()
Runs the crawler. Returns a promise that gets resolved once all the requests are processed.
Returns:
Promise<void>