RequestQueue
Represents a queue of URLs to crawl, which is used for deep crawling of websites where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders.
Each URL is represented using an instance of the Request
class. The queue can only contain unique URLs. More precisely, it can
only contain Request
instances with distinct uniqueKey
properties. By default, uniqueKey
is generated from the URL, but it can
also be overridden. To add a single URL multiple times to the queue, corresponding Request
objects will need to have different
uniqueKey
properties.
Do not instantiate this class directly, use the Apify.openRequestQueue()
function instead.
RequestQueue
is used by BasicCrawler
, CheerioCrawler
,
PuppeteerCrawler
and PlaywrightCrawler
as a source of URLs to crawl. Unlike
RequestList
, RequestQueue
supports dynamic adding and removing of requests. On the other hand, the queue is not optimized
for operations that add or remove a large number of URLs in a batch.
RequestQueue
stores its data either on local disk or in the Apify Cloud, depending on whether the APIFY_LOCAL_STORAGE_DIR
or APIFY_TOKEN
environment variable is set.
If the APIFY_LOCAL_STORAGE_DIR
environment variable is set, the queue data is stored in that directory in an SQLite database file.
If the APIFY_TOKEN
environment variable is set but APIFY_LOCAL_STORAGE_DIR
is not, the data is stored in the
Apify Request Queue cloud storage. Note that you can force usage of the cloud storage also by passing
the forceCloud
option to Apify.openRequestQueue()
function, even if the APIFY_LOCAL_STORAGE_DIR
variable is
set.
Example usage:
// Open the default request queue associated with the actor run
const queue = await Apify.openRequestQueue();
// Open a named request queue
const queueWithName = await Apify.openRequestQueue('some-name');
// Enqueue few requests
await queue.addRequest({ url: 'http://example.com/aaa' });
await queue.addRequest({ url: 'http://example.com/bbb' });
await queue.addRequest({ url: 'http://example.com/foo/bar' }, { forefront: true });
requestQueue.addRequest(requestLike, [options])
Adds a request to the queue.
If a request with the same uniqueKey
property is already present in the queue, it will not be updated. You can find out whether this happened from
the resulting QueueOperationInfo
object.
To add multiple requests to the queue by extracting links from a webpage, see the utils.enqueueLinks()
helper function.
Parameters:
requestLike
:Request
|RequestOptions
-Request
object or vanilla object with request data. Note that the function sets theuniqueKey
andid
fields to the passed Request.[options]
:Object
[forefront]
:boolean
= false
- Iftrue
, the request will be added to the foremost position in the queue.
Returns:
requestQueue.getRequest(id)
Gets the request from the queue specified by ID.
Parameters:
id
:string
- ID of the request.
Returns:
Promise<(Request|null)>
- Returns the request object, or null
if it was not found.
requestQueue.fetchNextRequest()
Returns a next request in the queue to be processed, or null
if there are no more pending requests.
Once you successfully finish processing of the request, you need to call
RequestQueue.markRequestHandled()
to mark the request as handled in the queue. If there was some error in
processing the request, call RequestQueue.reclaimRequest()
instead, so that the queue will give the request
to some other consumer in another call to the fetchNextRequest
function.
Note that the null
return value doesn't mean the queue processing finished, it means there are currently no pending requests. To check whether all
requests in queue were finished, use RequestQueue.isFinished()
instead.
Returns:
Promise<(Request|null)>
- Returns the request object or null
if there are no more pending requests.
requestQueue.markRequestHandled(request)
Marks a request that was previously returned by the RequestQueue.fetchNextRequest()
function as handled
after successful processing. Handled requests will never again be returned by the fetchNextRequest
function.
Parameters:
request
:Request
Returns:
requestQueue.reclaimRequest(request, [options])
Reclaims a failed request back to the queue, so that it can be returned for processed later again by another call to
RequestQueue.fetchNextRequest()
. The request record in the queue is updated using the provided request
parameter. For example, this lets you store the number of retries or error messages for the request.
Parameters:
request
:Request
[options]
:object
-[forefront]
:boolean
= false
- Iftrue
then the request it placed to the beginning of the queue, so that it's returned in the next call toRequestQueue.fetchNextRequest()
. By default, it's put to the end of the queue.
Returns:
requestQueue.isEmpty()
Resolves to true
if the next call to RequestQueue.fetchNextRequest()
would return null
, otherwise it
resolves to false
. Note that even if the queue is empty, there might be some pending requests currently being processed. If you need to ensure that
there is no activity in the queue, use RequestQueue.isFinished()
.
Returns:
Promise<boolean>
requestQueue.isFinished()
Resolves to true
if all requests were already handled and there are no more left. Due to the nature of distributed storage used by the queue, the
function might occasionally return a false negative, but it will never return a false positive.
Returns:
Promise<boolean>
requestQueue.drop()
Removes the queue either from the Apify Cloud storage or from the local database, depending on the mode of operation.
Returns:
Promise<void>
requestQueue.handledCount()
Returns the number of handled requests.
This function is just a convenient shortcut for:
const { handledRequestCount } = await queue.getInfo();
Returns:
Promise<number>
requestQueue.getInfo()
Returns an object containing general information about the request queue.
The function returns the same object as the Apify API Client's getQueue function, which in turn calls the Get request queue API endpoint.
Example:
{
id: "WkzbQMuFYuamGv3YF",
name: "my-queue",
userId: "wRsJZtadYvn4mBZmm",
createdAt: new Date("2015-12-12T07:34:14.202Z"),
modifiedAt: new Date("2015-12-13T08:36:13.202Z"),
accessedAt: new Date("2015-12-14T08:36:13.202Z"),
totalRequestCount: 25,
handledRequestCount: 5,
pendingRequestCount: 20,
}
Returns: