PuppeteerCrawlerOptions
Properties
handlePageFunction
Type: PuppeteerHandlePage
Function that is called to process each request. It is passed an object with the following fields:
{
request: Request,
response: Response,
page: Page,
session: Session,
browserController: BrowserController,
proxyInfo: ProxyInfo,
crawler: PuppeteerCrawler,
}
request
is an instance of the Request
object with details about the URL to open, HTTP method etc. page
is an instance of the
Puppeteer
Page
browserPool
is an instance of the
BrowserPool
, browserController
is an instance of the
BrowserController
, response
is an instance of the Puppeteer
Response
, which is the main resource response as returned by
page.goto(request.url)
. The function must return a promise, which is then awaited by the crawler.
If the function throws an exception, the crawler will try to re-crawl the request later, up to option.maxRequestRetries
times. If all the retries
fail, the crawler calls the function provided to the handleFailedRequestFunction
parameter. To make this work, you should always let your
function throw exceptions rather than catch them. The exceptions are logged to the request using the
Request.pushErrorMessage()
function.
navigationTimeoutSecs
Type: number
= 60
Timeout in which page navigation needs to finish, in seconds.
handleFailedRequestFunction
Type: HandleFailedRequest
A function to handle requests that failed more than option.maxRequestRetries
times.
The function receives the following object as an argument:
{
request: Request,
response: Response,
page: Page,
session: Session,
browserController: BrowserController,
proxyInfo: ProxyInfo,
crawler: PuppeteerCrawler,
}
Where the Request
instance corresponds to the failed request, and the Error
instance represents the last error thrown during
processing of the request.
launchContext
Type: PuppeteerLaunchContext
Options used by Apify.launchPuppeteer()
to start new Puppeteer instances.
handlePageTimeoutSecs
Type: number
= 60
Timeout in which the function passed as handlePageFunction
needs to finish, in seconds.
browserPoolOptions
Type: BrowserPoolOptions
Custom options passed to the underlying BrowserPool
constructor. You can tweak those to
fine-tune browser management.
persistCookiesPerSession
Type: boolean
= true
Automatically saves cookies to Session. Works only if Session Pool is used.
proxyConfiguration
Type: ProxyConfiguration
If set, PuppeteerCrawler
will be configured for all connections to use Apify Proxy or your own Proxy URLs
provided and rotated according to the configuration. For more information, see the documentation.
preNavigationHooks
Type: Array<PuppeteerHook>
Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation.
The function accepts two parameters, crawlingContext
and gotoOptions
, which are passed to the page.goto()
function the crawler calls to
navigate. Example:
preNavigationHooks: [
async (crawlingContext, gotoOptions) => {
const { page } = crawlingContext;
await page.evaluate((attr) => { window.foo = attr; }, 'bar');
},
]
postNavigationHooks
Type: Array<PuppeteerHook>
Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts
crawlingContext
as the only parameter. Example:
postNavigationHooks: [
async (crawlingContext) => {
const { page } = crawlingContext;
if (hasCaptcha(page)) {
await solveCaptcha (page);
}
},
]
requestList
Type: RequestList
Static list of URLs to be processed. Either requestList
or requestQueue
option must be provided (or both).
requestQueue
Type: RequestQueue
Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. Either requestList
or requestQueue
option must be
provided (or both).
maxRequestRetries
Type: number
= 3
Indicates how many times the request is retried if
PuppeteerCrawlerOptions.handlePageFunction
fails.
maxRequestsPerCrawl
Type: number
Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. Always set this value in order to prevent infinite loops in misconfigured crawlers. Note that in cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.
autoscaledPoolOptions
Type: AutoscaledPoolOptions
Custom options passed to the underlying AutoscaledPool
constructor. Note that the runTaskFunction
and
isTaskReadyFunction
options are provided by the crawler and cannot be overridden. However, you can provide a custom implementation of
isFinishedFunction
.
minConcurrency
Type: number
= 1
Sets the minimum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool
option.
WARNING: If you set this value too high with respect to the available system memory and CPU, your crawler will run extremely slow or crash. If you're not sure, just keep the default value and the concurrency will scale up automatically.
maxConcurrency
Type: number
= 1000
Sets the maximum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool
option.
useSessionPool
Type: boolean
= true
Puppeteer crawler will initialize the SessionPool
with the corresponding sessionPoolOptions
. The session instance will be
than available in the handleRequestFunction
.
sessionPoolOptions
Type: SessionPoolOptions
The configuration options for SessionPool
to use.