Skip to main content

RequestQueue

Represents a queue storage for managing HTTP requests in web crawling operations.

The RequestQueue class handles a queue of HTTP requests, each identified by a unique URL, to facilitate structured web crawling. It supports both breadth-first and depth-first crawling strategies, allowing for recursive crawling starting from an initial set of URLs. Each URL in the queue is uniquely identified by a unique_key, which can be customized to allow the same URL to be added multiple times under different keys.

Data can be stored either locally or in the cloud. It depends on the setup of underlying storage client. By default a MemoryStorageClient is used, but it can be changed to a different one.

By default, data is stored using the following path structure:

{CRAWLEE_STORAGE_DIR}/request_queues/{QUEUE_ID}/{REQUEST_ID}.json
  • {CRAWLEE_STORAGE_DIR}: The root directory for all storage data specified by the environment variable.
  • {QUEUE_ID}: The identifier for the request queue, either "default" or as specified.
  • {REQUEST_ID}: The unique identifier for each request in the queue.

The RequestQueue supports both creating new queues and opening existing ones by id or name. Named queues persist indefinitely, while unnamed queues expire after 7 days unless specified otherwise. The queue supports mutable operations, allowing URLs to be added and removed as needed.

Usage

from crawlee.storages import RequestQueue

rq = await RequestQueue.open(name='my_rq')

Hierarchy

Index

Methods

__init__

  • __init__(*, id, name, configuration, client, event_manager): None
  • Parameters

    Returns None

add_request

  • Add a single request to the provider and store it in underlying resource client.


    Parameters

    • optionalkeyword-onlyrequest: str | Request

      The request object (or its string representation) to be added to the provider.

    • optionalkeyword-onlyforefront: bool = False

      Determines whether the request should be added to the beginning (if True) or the end (if False) of the provider.

    Returns ProcessedRequest

add_requests_batched

  • async add_requests_batched(*, requests, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None
  • Add requests to the underlying resource client in batches.


    Parameters

    • optionalkeyword-onlyrequests: Sequence[str | Request]

      Requests to add to the queue.

    • optionalkeyword-onlybatch_size: int = 1000

      The number of requests to add in one batch.

    • optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(seconds=1)

      Time to wait between adding batches.

    • optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False

      If True, wait for all requests to be added before returning.

    • optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None

      Timeout for waiting for all requests to be added.

    Returns None

drop

  • async drop(): None
  • Drop the storage. Remove it from underlying storage and delete from cache.


    Returns None

fetch_next_request

  • async fetch_next_request(): Request | None
  • Return the next request in the queue to be processed.

    Once you successfully finish processing of the request, you need to call RequestQueue.mark_request_as_handled to mark the request as handled in the queue. If there was some error in processing the request, call RequestQueue.reclaim_request instead, so that the queue will give the request to some other consumer in another call to the fetch_next_request method.

    Note that the None return value does not mean the queue processing finished, it means there are currently no pending requests. To check whether all requests in queue were finished, use RequestQueue.is_finished instead.


    Returns Request | None

get_handled_count

  • async get_handled_count(): int
  • Returns the number of handled requests.


    Returns int

get_info

  • Get an object containing general information about the request queue.


    Returns RequestQueueMetadata | None

get_request

  • async get_request(*, request_id): Request | None
  • Retrieve a request from the queue.


    Parameters

    • optionalkeyword-onlyrequest_id: str

      ID of the request to retrieve.

    Returns Request | None

get_total_count

  • async get_total_count(): int
  • Returns an offline approximation of the total number of requests in the queue (i.e. pending + handled).


    Returns int

is_empty

  • async is_empty(): bool
  • Check whether the queue is empty.


    Returns bool

is_finished

  • async is_finished(): bool
  • Check whether the queue is finished.

    Due to the nature of distributed storage used by the queue, the function might occasionally return a false negative, but it will never return a false positive.


    Returns bool

mark_request_as_handled

  • Mark a request as handled after successful processing.

    Handled requests will never again be returned by the RequestQueue.fetch_next_request method.


    Parameters

    • optionalkeyword-onlyrequest: Request

      The request to mark as handled.

    Returns ProcessedRequest | None

open

  • Open a storage, either restore existing or create a new one.


    Parameters

    • optionalkeyword-onlyid: str | None = None

      The storage ID.

    • optionalkeyword-onlyname: str | None = None

      The storage name.

    • optionalkeyword-onlyconfiguration: Configuration | None = None

      The configuration to use.

    Returns BaseStorage

reclaim_request

  • Reclaim a failed request back to the queue.

    The request will be returned for processing later again by another call to RequestQueue.fetch_next_request.


    Parameters

    • optionalkeyword-onlyrequest: Request

      The request to return to the queue.

    • optionalkeyword-onlyforefront: bool = False

      Whether to add the request to the head or the end of the queue.

    Returns ProcessedRequest | None

Properties

id

id: str

Get the storage ID.

name

name: str | None

Get the storage name.