CheerioCrawlerOptions
Properties
handlePageFunction
Type: CheerioHandlePage
User-provided function that performs the logic of the crawler. It is called for each page loaded and parsed by the crawler.
The function receives the following object as an argument:
{
// The Cheerio object's function with the parsed HTML.
$: Cheerio,
// The request body of the web page, whose type depends on the content type.
body: String|Buffer,
// The parsed object from JSON for responses with the "application/json" content types.
// For other content types it's null.
json: Object,
// Apify.Request object with details of the requested web page
request: Request,
// Parsed Content-Type HTTP header: { type, encoding }
contentType: Object,
// An instance of Node's http.IncomingMessage object,
response: Object,
// Session object, useful to work around anti-scraping protections
session: Session
// ProxyInfo object with information about currently used proxy
proxyInfo: ProxyInfo
// The running cheerio crawler instance.
crawler: CheerioCrawler
}
Type of body
depends on the Content-Type
header of the web page:
- String for
text/html
,application/xhtml+xml
,application/xml
MIME content types - Buffer for others MIME content types
Parsed Content-Type
header using content-type package is stored in contentType
.
Cheerio is available only for HTML and XML content types.
With the Request
object representing the URL to crawl.
If the function returns, the returned promise is awaited by the crawler.
If the function throws an exception, the crawler will try to re-crawl the request later, up to option.maxRequestRetries
times. If all the retries
fail, the crawler calls the function provided to the handleFailedRequestFunction
parameter. To make this work, you should always let your
function throw exceptions rather than catch them. The exceptions are logged to the request using the
Request.pushErrorMessage()
function.
requestList
Type: RequestList
Static list of URLs to be processed. Either requestList
or requestQueue
option must be provided (or both).
requestQueue
Type: RequestQueue
Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. Either requestList
or requestQueue
option must be
provided (or both).
prepareRequestFunction
Type: PrepareRequest
This option is deprecated, use
preNavigationHooks
instead.
A function that executes before the HTTP request is made to the target resource. This function is suitable for setting dynamic properties such as
cookies to the Request
.
The function receives the following object as an argument:
{
request: Request,
session: Session,
proxyInfo: ProxyInfo,
crawler: CheerioCrawler,
}
where the Request
instance corresponds to the initialized request and the Session
instance corresponds to used
session.
The function should modify the properties of the passed Request
instance in place because there are already earlier references to
it. Making a copy and returning it from this function is therefore not supported, because it would create inconsistencies where different parts of SDK
would have access to a different Request
instance.
postResponseFunction
Type: PostResponse
This option is deprecated, use
postNavigationHooks
instead.
A function that executes right after the HTTP request is made to the target resource and response is returned. This function is suitable for overriding custom properties of response e.g. setting headers because of response parsing.
Example usage:
const cheerioCrawlerOptions = {
// ...
postResponseFunction: ({ request, response }) => {
if (request.userData.parseAsJSON) {
response.headers['content-type'] = 'application/json; charset=utf-8';
}
},
};
The function receives the following object as an argument:
{
response: Object,
request: Request,
session: Session,
proxyInfo: ProxyInfo,
crawler: CheerioCrawler,
}
The response is an instance of Node's http.IncomingMessage object.
handlePageTimeoutSecs
Type: number
= 60
Timeout in which the function passed as handlePageFunction
needs to finish, given in seconds.
requestTimeoutSecs
Type: number
= 30
Timeout in which the HTTP request to the resource needs to finish, given in seconds.
ignoreSslErrors
Type: boolean
= true
If set to true, SSL certificate errors will be ignored.
proxyConfiguration
Type: ProxyConfiguration
If set, CheerioCrawler
will be configured for all connections to use Apify Proxy or your own Proxy URLs provided
and rotated according to the configuration. For more information, see the documentation.
handleFailedRequestFunction
Type: HandleFailedRequest
A function to handle requests that failed more than option.maxRequestRetries
times. The function receives the following object as an argument:
{
error: Error,
request: Request,
session: Session,
$: Cheerio,
body: String|Buffer,
json: Object,
contentType: Object,
response: Object,
proxyInfo: ProxyInfo,
crawler: CheerioCrawler,
}
where the Request
instance corresponds to the failed request, and the Error
instance represents the last error thrown during
processing of the request.
See source code for the default implementation of this function.
preNavigationHooks
Type: Array<Hook>
Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation.
The function accepts two parameters, crawlingContext
and requestAsBrowserOptions
, which are passed to the requestAsBrowser()
function the
crawler calls to navigate. Example:
preNavigationHooks: [
async (crawlingContext, requestAsBrowserOptions) => {
requestAsBrowserOptions.forceUrlEncoding = true;
},
]
postNavigationHooks
Type: Array<Hook>
Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts
crawlingContext
as the only parameter. Example:
postNavigationHooks: [
async (crawlingContext) => {
// ...
},
]
additionalMimeTypes
Type: Array<string>
An array of MIME types you want the crawler to load and process. By default, only text/html
and application/xhtml+xml
MIME types are
supported.
suggestResponseEncoding
Type: string
By default CheerioCrawler
will extract correct encoding from the HTTP response headers. Sadly, there are some websites which use invalid headers.
Those are encoded using the UTF-8 encoding. If those sites actually use a different encoding, the response will be corrupted. You can use
suggestResponseEncoding
to fall back to a certain encoding, if you know that your target website uses it. To force a certain encoding, disregarding
the response headers, use CheerioCrawlerOptions.forceResponseEncoding
// Will fall back to windows-1250 encoding if none found
suggestResponseEncoding: 'windows-1250'
forceResponseEncoding
Type: string
By default CheerioCrawler
will extract correct encoding from the HTTP response headers. Use forceResponseEncoding
to force a certain encoding,
disregarding the response headers. To only provide a default for missing encodings, use
CheerioCrawlerOptions.suggestResponseEncoding
// Will force windows-1250 encoding even if headers say otherwise
forceResponseEncoding: 'windows-1250'
maxRequestRetries
Type: number
= 3
Indicates how many times the request is retried if either requestFunction
or handlePageFunction
fails.
maxRequestsPerCrawl
Type: number
Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. Always set this value in order to prevent infinite loops in misconfigured crawlers. Note that in cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.
autoscaledPoolOptions
Type: AutoscaledPoolOptions
Custom options passed to the underlying AutoscaledPool
constructor. Note that the runTaskFunction
, isTaskReadyFunction
and isFinishedFunction
options are provided by CheerioCrawler
and cannot be overridden. Reasonable Snapshotter
and
SystemStatus
defaults are provided to account for the fact that cheerio
parses HTML synchronously and therefore blocks the
event loop.
minConcurrency
Type: number
= 1
Sets the minimum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool
option.
WARNING: If you set this value too high with respect to the available system memory and CPU, your crawler will run extremely slow or crash. If you're not sure, just keep the default value and the concurrency will scale up automatically.
maxConcurrency
Type: number
= 1000
Sets the maximum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool
option.
useSessionPool
Type: boolean
= true
If set to true Crawler will automatically use Session Pool. It will automatically retire sessions on 403, 401 and 429 status codes. It also marks Session as bad after a request timeout.
sessionPoolOptions
Type: SessionPoolOptions
Custom options passed to the underlying SessionPool
constructor.
persistCookiesPerSession
Type: boolean
Automatically saves cookies to Session. Works only if Session Pool is used.
It parses cookie from response "set-cookie" header saves or updates cookies for session and once the session is used for next request. It passes the "Cookie" header to the request with the session cookies.