Version: 1.3

utils

A namespace that contains various utilities.

Example usage:

const Apify = require('apify');

...

// Sleep 1.5 seconds
await Apify.utils.sleep(1500);

`utils.URL_NO_COMMAS_REGEX`

Default regular expression to match URLs in a string that may be plain text, JSON, CSV or other. It supports common URL characters and does not support URLs containing commas or spaces. The URLs also may contain Unicode letters (not symbols).

`utils.URL_WITH_COMMAS_REGEX`

Regular expression that, in addition to the default regular expression URL_NO_COMMAS_REGEX, supports matching commas in URL path and query. Note, however, that this may prevent parsing URLs from comma delimited lists, or the URLs may become malformed.

`utils.enqueueLinks(options)`

The function finds elements matching a specific CSS selector (HTML anchor (<a>) by default) either in a Puppeteer page, or in a Cheerio object (parsed HTML), and enqueues the URLs in their href attributes to the provided RequestQueue. If you're looking to find URLs in JavaScript heavy pages where links are not available in href elements, but rather navigations are triggered in click handlers see puppeteer.enqueueLinksByClickingElements().

Optionally, the function allows you to filter the target links' URLs using an array of PseudoUrl objects and override settings of the enqueued Request objects.

Example usage

await Apify.utils.enqueueLinks({
    page,
    requestQueue,
    selector: 'a.product-detail',
    pseudoUrls: [
        'https://www.example.com/handbags/[.*]',
        'https://www.example.com/purses/[.*]',
    ],
});

Parameters:

options: object - All enqueueLinks() parameters are passed via an options object with the following keys:
- [page]: PuppeteerPage | PlaywrightPage - Puppeteer Page object. Either page or $ option must be provided.
- [limit]: Number - Limit the count of actually enqueued URLs to this number. Useful for testing across the entire crawling scope.
- [$]: cheerio.Root | cheerio.Selector - Cheerio function with loaded HTML. Either page or $ option must be provided.
- requestQueue: RequestQueue - A request queue to which the URLs will be enqueued.
- [selector]: string = "'a'" - A CSS selector matching links to be enqueued.
- [baseUrl]: string - A base URL that will be used to resolve relative URLs when using Cheerio. Ignored when using Puppeteer, since the relative URL resolution is done inside the browser automatically.
- [pseudoUrls]: Array<Object<string, *>> | Array.<string> - An array of PseudoUrls matching the URLs to be enqueued, or an array of strings or RegExps or plain Objects from which the PseudoUrls can be constructed.
The plain objects must include at least the purl property, which holds the pseudo-URL string or RegExp. All remaining keys will be used as the requestTemplate argument of the PseudoUrl constructor, which lets you specify special properties for the enqueued Request objects.

If pseudoUrls is an empty array, null or undefined, then the function enqueues all links found on the page.
- [transformRequestFunction]: RequestTransform - Just before a new Request is constructed and enqueued to the RequestQueue, this function can be used to remove it or modify its contents such as userData, payload or, most importantly uniqueKey. This is useful when you need to enqueue multiple Requests to the queue that share the same URL, but differ in methods or payloads, or to dynamically update or create userData.
For example: by adding keepUrlFragment: true to the request object, URL fragments will not be removed when uniqueKey is computed.

Example:
```
{
    transformRequestFunction: (request) => {
        request.userData.foo = 'bar';
        request.keepUrlFragment = true;
        return request;
    };
}
```

Returns:

Promise<Array<QueueOperationInfo>> - Promise that resolves to an array of QueueOperationInfo objects.

`utils.requestAsBrowser(options)`

IMPORTANT: This function uses an insecure version of HTTP parser by default and also ignores SSL/TLS errors. This is very useful in scraping, because it allows bypassing certain anti-scraping walls, but it also exposes some vulnerability. For other than scraping scenarios, please set useInsecureHttpParser: false and ignoreSslErrors: false.

Sends a HTTP request that looks like a request sent by a web browser, fully emulating browser's HTTP headers. It uses HTTP2 by default for Node 12+.

This function is useful for web scraping of websites that send the full HTML in the first response. Thanks to this function, the target web server has no simple way to find out the request hasn't been sent by a human's web browser. Using a headless browser for such requests is an order of magnitude more resource-intensive than this function.

The function emulates the Chrome and Firefox web browsers. If you want more control over the browsers and their versions, use the headerGeneratorOptions property. You can find more info in the readme of the header-generator library.

Internally, the function uses the got-scraping library to perform the request. All options not recognized by this function are passed to it so see it for more details.

Example usage:

const Apify = require('apify');

const { utils: { requestAsBrowser } } = Apify;

...

const response = await requestAsBrowser({ url: 'https://www.example.com/' });

const html = response.body;
const status = response.statusCode;
const contentType = response.headers['content-type'];

Parameters:

options: RequestAsBrowserOptions - All requestAsBrowser configuration options.

Returns:

Promise<RequestAsBrowserResult> - The result can be various objects, but it will always be like a Node.js HTTP response stream with a 'body' property for the parsed response body, unless the 'stream' option is used.

`utils.isDocker(forceReset)`

Returns a Promise that resolves to true if the code is running in a Docker container.

Parameters:

forceReset: boolean

Returns:

Promise<boolean>

`utils.sleep(millis)`

Returns a Promise that resolves after a specific period of time. This is useful to implement waiting in your code, e.g. to prevent overloading of target website or to avoid bot detection.

Example usage:

const Apify = require('apify');

...

// Sleep 1.5 seconds
await Apify.utils.sleep(1500);

Parameters:

millis: number - Period of time to sleep, in milliseconds. If not a positive number, the returned promise resolves immediately.

Returns:

Promise<void>

`utils.downloadListOfUrls(options)`

Returns a promise that resolves to an array of urls parsed from the resource available at the provided url. Optionally, custom regular expression and encoding may be provided.

Parameters:

options: object
- url: string - URL to the file
- [encoding]: string = "'utf8'" - The encoding of the file.
- [urlRegExp]: RegExp = URL_NO_COMMAS_REGEX - Custom regular expression to identify the URLs in the file to extract. The regular expression should be case-insensitive and have global flag set (i.e. /something/gi).

Returns:

Promise<Array<string>>

`utils.extractUrls(options)`

Collects all URLs in an arbitrary string to an array, optionally using a custom regular expression.

Parameters:

options: object
- string: string
- [urlRegExp]: RegExp = Apify.utils.URL_NO_COMMAS_REGEX

Returns:

Array<string>

`utils.htmlToText(html)`

The function converts a HTML document to a plain text.

The plain text generated by the function is similar to a text captured by pressing Ctrl+A and Ctrl+C on a page when loaded in a web browser. The function doesn't aspire to preserve the formatting or to be perfectly correct with respect to HTML specifications. However, it attempts to generate newlines and whitespaces in and around HTML elements to avoid merging distinct parts of text and thus enable extraction of data from the text (e.g. phone numbers).

Example usage

const text = htmlToText('<html><body>Some text</body></html>');
console.log(text);

Note that the function uses cheerio to parse the HTML. Optionally, to avoid duplicate parsing of HTML and thus improve performance, you can pass an existing Cheerio object to the function instead of the HTML text. The HTML should be parsed with the decodeEntities option set to true. For example:

const cheerio = require('cheerio');
const html = '<html><body>Some text</body></html>';
const text = htmlToText(cheerio.load(html, { decodeEntities: true }));

Parameters:

html: string | cheerio.Root - HTML text or parsed HTML represented using a cheerio function.

Returns:

string - Plain text