Dataset
Hierarchy
- BaseStorage
- Dataset
Index
Methods
__init__
Parameters
optionalkeyword-onlyid: str
optionalkeyword-onlyname: str | None
optionalkeyword-onlystorage_client: BaseStorageClient
Returns None
check_and_serialize
Serializes a given item to JSON, checks its serializability and size against a limit.
Parameters
optionalkeyword-onlyitem: JsonSerializable
The item to serialize.
optionalkeyword-onlyindex: int | None = None
Index of the item, used for error context.
Returns str
drop
Drop the storage, removing it from the underlying storage client and clearing the cache.
Returns None
export_to
Exports the entire dataset into a specified file stored under a key in a key-value store.
This method consolidates all entries from a specified dataset into one file, which is then saved under a given key in a key-value store. The format of the exported file is determined by the
content_type
parameter. Either the dataset's ID or name should be specified, and similarly, either the target key-value store's ID or name should be used.Parameters
Returns None
get_data
Retrieves dataset items based on filtering, sorting, and pagination parameters.
This method allows customization of the data retrieval process from a dataset, supporting operations such as field selection, ordering, and skipping specific records based on provided parameters.
Parameters
Returns DatasetItemsListPage
get_info
Get an object containing general information about the dataset.
Returns DatasetMetadata | None
iterate_items
Iterates over dataset items, applying filtering, sorting, and pagination.
Retrieves dataset items incrementally, allowing fine-grained control over the data fetched. The function supports various parameters to filter, sort, and limit the data returned, facilitating tailored dataset queries.
Parameters
optionalkeyword-onlyoffset: int = 0
Initial number of items to skip.
optionalkeyword-onlylimit: int | None = None
Max number of items to return. No limit if None.
optionalkeyword-onlyclean: bool = False
Filters out empty items and hidden fields if True.
optionalkeyword-onlydesc: bool = False
Returns items in reverse order if True.
optionalkeyword-onlyfields: list[str] | None = None
Specific fields to include in each item.
optionalkeyword-onlyomit: list[str] | None = None
Fields to omit from each item.
optionalkeyword-onlyunwind: str | None = None
Field name to unwind items by.
optionalkeyword-onlyskip_empty: bool = False
Omits empty items if True.
optionalkeyword-onlyskip_hidden: bool = False
Excludes fields starting with '#' if True.
Returns AsyncIterator[dict]
open
Open a storage, either restore existing or create a new one.
Parameters
optionalkeyword-onlyid: str | None = None
The storage ID.
optionalkeyword-onlyname: str | None = None
The storage name.
optionalkeyword-onlyconfiguration: Configuration | None = None
Configuration object used during the storage creation or restoration process.
optionalkeyword-onlystorage_client: BaseStorageClient | None = None
Underlying storage client to use. If not provided, the default global storage client from the service locator will be used.
Returns BaseStorage
push_data
Store an object or an array of objects to the dataset.
The size of the data is limited by the receiving API and therefore
push_data()
will only allow objects whose JSON representation is smaller than 9MB. When an array is passed, none of the included objects may be larger than 9MB, but the array itself may be of any size.Parameters
optionalkeyword-onlydata: JsonSerializable
A JSON serializable data structure to be stored in the dataset. The JSON representation of each item must be smaller than 9MB.
Returns None
write_to_csv
Exports the entire dataset into an arbitrary stream.
Parameters
optionalkeyword-onlydestination: TextIO
The stream into which the dataset contents should be written.
Returns None
write_to_json
Exports the entire dataset into an arbitrary stream.
Parameters
optionalkeyword-onlydestination: TextIO
The stream into which the dataset contents should be written.
Returns None
Properties
id
Get the storage ID.
name
Get the storage name.
Represents an append-only structured storage, ideal for tabular data similar to database tables.
The
Dataset
class is designed to store structured data, where each entry (row) maintains consistent attributes (columns) across the dataset. It operates in an append-only mode, allowing new records to be added, but not modified or deleted. This makes it particularly useful for storing results from web crawling operations.Data can be stored either locally or in the cloud. It depends on the setup of underlying storage client. By default a
MemoryStorageClient
is used, but it can be changed to a different one.By default, data is stored using the following path structure:
{CRAWLEE_STORAGE_DIR}
: The root directory for all storage data specified by the environment variable.{DATASET_ID}
: Specifies the dataset, either "default" or a custom dataset ID.{INDEX}
: Represents the zero-based index of the record within the dataset.To open a dataset, use the
open
class method by specifying anid
,name
, orconfiguration
. If none are provided, the default dataset for the current crawler run is used. Attempting to open a dataset byid
that does not exist will raise an error; however, if accessed byname
, the dataset will be created if it doesn't already exist.Usage