RequestQueue
Hierarchy
- BaseStorage
- RequestManager
- RequestQueue
Index
Methods
__init__
Parameters
optionalkeyword-onlyid: str
optionalkeyword-onlyname: str | None
optionalkeyword-onlystorage_client: BaseStorageClient
Returns None
add_request
Add a single request to the manager and store it in underlying resource client.
Parameters
optionalkeyword-onlyrequest: str | Request
The request object (or its string representation) to be added to the manager.
optionalkeyword-onlyforefront: bool = False
Determines whether the request should be added to the beginning (if True) or the end (if False) of the manager.
Returns ProcessedRequest
add_requests_batched
Add requests to the manager in batches.
Parameters
optionalkeyword-onlyrequests: Sequence[str | Request]
Requests to enqueue.
optionalkeyword-onlybatch_size: int = 1000
The number of requests to add in one batch.
optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(seconds=1)
Time to wait between adding batches.
optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False
If True, wait for all requests to be added before returning.
optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None
Timeout for waiting for all requests to be added.
Returns None
drop
Drop the storage, removing it from the underlying storage client and clearing the cache.
Returns None
fetch_next_request
Return the next request in the queue to be processed.
Once you successfully finish processing of the request, you need to call
RequestQueue.mark_request_as_handled
to mark the request as handled in the queue. If there was some error in processing the request, callRequestQueue.reclaim_request
instead, so that the queue will give the request to some other consumer in another call to thefetch_next_request
method.Note that the
None
return value does not mean the queue processing finished, it means there are currently no pending requests. To check whether all requests in queue were finished, useRequestQueue.is_finished
instead.Returns Request | None
get_handled_count
Return the number of handled requests.
Returns int
get_info
Get an object containing general information about the request queue.
Returns RequestQueueMetadata | None
get_request
Retrieve a request from the queue.
Parameters
optionalkeyword-onlyrequest_id: str
ID of the request to retrieve.
Returns Request | None
get_total_count
Return an offline approximation of the total number of requests in the source (i.e. pending + handled).
Returns int
is_empty
Check whether the queue is empty.
Returns bool
is_finished
Check whether the queue is finished.
Due to the nature of distributed storage used by the queue, the function might occasionally return a false negative, but it will never return a false positive.
Returns bool
mark_request_as_handled
Mark a request as handled after successful processing.
Handled requests will never again be returned by the
RequestQueue.fetch_next_request
method.Parameters
optionalkeyword-onlyrequest: Request
The request to mark as handled.
Returns ProcessedRequest | None
open
Open a storage, either restore existing or create a new one.
Parameters
optionalkeyword-onlyid: str | None = None
The storage ID.
optionalkeyword-onlyname: str | None = None
The storage name.
optionalkeyword-onlyconfiguration: Configuration | None = None
Configuration object used during the storage creation or restoration process.
optionalkeyword-onlystorage_client: BaseStorageClient | None = None
Underlying storage client to use. If not provided, the default global storage client from the service locator will be used.
Returns BaseStorage
reclaim_request
Reclaim a failed request back to the queue.
The request will be returned for processing later again by another call to
RequestQueue.fetch_next_request
.Parameters
optionalkeyword-onlyrequest: Request
The request to return to the queue.
optionalkeyword-onlyforefront: bool = False
Whether to add the request to the head or the end of the queue.
Returns ProcessedRequest | None
Properties
id
Get the storage ID.
name
Get the storage name.
Represents a queue storage for managing HTTP requests in web crawling operations.
The
RequestQueue
class handles a queue of HTTP requests, each identified by a unique URL, to facilitate structured web crawling. It supports both breadth-first and depth-first crawling strategies, allowing for recursive crawling starting from an initial set of URLs. Each URL in the queue is uniquely identified by aunique_key
, which can be customized to allow the same URL to be added multiple times under different keys.Data can be stored either locally or in the cloud. It depends on the setup of underlying storage client. By default a
MemoryStorageClient
is used, but it can be changed to a different one.By default, data is stored using the following path structure:
{CRAWLEE_STORAGE_DIR}
: The root directory for all storage data specified by the environment variable.{QUEUE_ID}
: The identifier for the request queue, either "default" or as specified.{REQUEST_ID}
: The unique identifier for each request in the queue.The
RequestQueue
supports both creating new queues and opening existing ones byid
orname
. Named queues persist indefinitely, while unnamed queues expire after 7 days unless specified otherwise. The queue supports mutable operations, allowing URLs to be added and removed as needed.Usage