Version: 0.2

Dataset

The Dataset class represents a store for structured data where each object stored has the same attributes.

You can imagine it as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage - you can only add new records to it but you cannot modify or remove existing records. Typically it is used to store crawling results.

Do not instantiate this class directly, use the Actor.open_dataset() function instead.

Dataset stores its data either on local disk or in the Apify cloud, depending on whether the APIFY_LOCAL_STORAGE_DIR or APIFY_TOKEN environment variables are set.

If the APIFY_LOCAL_STORAGE_DIR environment variable is set, the data is stored in the local directory in the following files:

{APIFY_LOCAL_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json

Note that {DATASET_ID} is the name or ID of the dataset. The default dataset has ID: default, unless you override it by setting the APIFY_DEFAULT_DATASET_ID environment variable. Each dataset item is stored as a separate JSON file, where {INDEX} is a zero-based index of the item in the dataset.

If the APIFY_TOKEN environment variable is set but APIFY_LOCAL_STORAGE_DIR is not, the data is stored in the Apify Dataset cloud storage.

Index

Methods

drop

async drop(): None

Remove the dataset either from the Apify cloud storage or from the local directory.
Returns None

export_to

async export_to(key, *, to_key_value_store_id, to_key_value_store_name, content_type): None

Save the entirety of the dataset's contents into one file within a key-value store.
Parameters
- key: str
  The key to save the data under.
- optionalkeyword-onlyto_key_value_store_id: Optional[str] = None
  The id of the key-value store in which the result will be saved.
- optionalkeyword-onlyto_key_value_store_name: Optional[str] = None
  The name of the key-value store in which the result will be saved. You must specify only one of to_key_value_store_id and to_key_value_store_name arguments. If you omit both, it uses the default key-value store.
- optionalkeyword-onlycontent_type: Optional[str] = None
  Either 'text/csv' or 'application/json'. Defaults to JSON.
Returns None

export_to_csv

async export_to_csv(key, *, from_dataset_id, from_dataset_name, to_key_value_store_id, to_key_value_store_name): None

Save the entirety of the dataset's contents into one CSV file within a key-value store.
Parameters
- key: str
  The key to save the data under.
- optionalkeyword-onlyfrom_dataset_id: Optional[str] = None
  The ID of the dataset in case of calling the class method. Uses default dataset if omitted.
- optionalkeyword-onlyfrom_dataset_name: Optional[str] = None
  The name of the dataset in case of calling the class method. Uses default dataset if omitted. You must specify only one of from_dataset_id and from_dataset_name arguments. If you omit both, it uses the default dataset.
- optionalkeyword-onlyto_key_value_store_id: Optional[str] = None
  The id of the key-value store in which the result will be saved.
- optionalkeyword-onlyto_key_value_store_name: Optional[str] = None
  The name of the key-value store in which the result will be saved. You must specify only one of to_key_value_store_id and to_key_value_store_name arguments. If you omit both, it uses the default key-value store.
Returns None

export_to_json

async export_to_json(key, *, from_dataset_id, from_dataset_name, to_key_value_store_id, to_key_value_store_name): None

Save the entirety of the dataset's contents into one JSON file within a key-value store.
Parameters
- key: str
  The key to save the data under.
- optionalkeyword-onlyfrom_dataset_id: Optional[str] = None
  The ID of the dataset in case of calling the class method. Uses default dataset if omitted.
- optionalkeyword-onlyfrom_dataset_name: Optional[str] = None
  The name of the dataset in case of calling the class method. Uses default dataset if omitted. You must specify only one of from_dataset_id and from_dataset_name arguments. If you omit both, it uses the default dataset.
- optionalkeyword-onlyto_key_value_store_id: Optional[str] = None
  The id of the key-value store in which the result will be saved.
- optionalkeyword-onlyto_key_value_store_name: Optional[str] = None
  The name of the key-value store in which the result will be saved. You must specify only one of to_key_value_store_id and to_key_value_store_name arguments. If you omit both, it uses the default key-value store.
Returns None

get_data

async get_data(*, offset, limit, clean, desc, fields, omit, unwind, skip_empty, skip_hidden, flatten, view): ListPage

Get items from the dataset.
Parameters
- optionalkeyword-onlyoffset: Optional[int] = None
  Number of items that should be skipped at the start. The default value is 0
- optionalkeyword-onlylimit: Optional[int] = None
  Maximum number of items to return. By default there is no limit.
- optionalkeyword-onlyclean: Optional[bool] = None
  If True, returns only non-empty items and skips hidden fields (i.e. fields starting with the # character). The clean parameter is just a shortcut for skip_hidden=True and skip_empty=True parameters. Note that since some objects might be skipped from the output, that the result might contain less items than the limit value.
- optionalkeyword-onlydesc: Optional[bool] = None
  By default, results are returned in the same order as they were stored. To reverse the order, set this parameter to True.
- optionalkeyword-onlyfields: Optional[List[str]] = None
  A list of fields which should be picked from the items, only these fields will remain in the resulting record objects. Note that the fields in the outputted items are sorted the same way as they are specified in the fields parameter. You can use this feature to effectively fix the output format.
- optionalkeyword-onlyomit: Optional[List[str]] = None
  A list of fields which should be omitted from the items.
- optionalkeyword-onlyunwind: Optional[str] = None
  Name of a field which should be unwound. If the field is an array then every element of the array will become a separate record and merged with parent object. If the unwound field is an object then it is merged with the parent object. If the unwound field is missing or its value is neither an array nor an object and therefore cannot be merged with a parent object, then the item gets preserved as it is. Note that the unwound items ignore the desc parameter.
- optionalkeyword-onlyskip_empty: Optional[bool] = None
  If True, then empty items are skipped from the output. Note that if used, the results might contain less items than the limit value.
- optionalkeyword-onlyskip_hidden: Optional[bool] = None
  If True, then hidden fields are skipped from the output, i.e. fields starting with the # character.
- optionalkeyword-onlyflatten: Optional[List[str]] = None
  A list of fields that should be flattened
- optionalkeyword-onlyview: Optional[str] = None
  Name of the dataset view to be used
Returns ListPage
ListPage: A page of the list of dataset items according to the specified filters.

get_info

async get_info(): Optional[Dict]

Get an object containing general information about the dataset.
Returns Optional[Dict]
dict: Object returned by calling the GET dataset API endpoint.

iterate_items

iterate_items(*, offset, limit, clean, desc, fields, omit, unwind, skip_empty, skip_hidden): AsyncIterator[Dict]

Iterate over the items in the dataset.
Parameters
- optionalkeyword-onlyoffset: int = 0
  Number of items that should be skipped at the start. The default value is 0
- optionalkeyword-onlylimit: Optional[int] = None
  Maximum number of items to return. By default there is no limit.
- optionalkeyword-onlyclean: Optional[bool] = None
  If True, returns only non-empty items and skips hidden fields (i.e. fields starting with the # character). The clean parameter is just a shortcut for skip_hidden=True and skip_empty=True parameters. Note that since some objects might be skipped from the output, that the result might contain less items than the limit value.
- optionalkeyword-onlydesc: Optional[bool] = None
  By default, results are returned in the same order as they were stored. To reverse the order, set this parameter to True.
- optionalkeyword-onlyfields: Optional[List[str]] = None
  A list of fields which should be picked from the items, only these fields will remain in the resulting record objects. Note that the fields in the outputted items are sorted the same way as they are specified in the fields parameter. You can use this feature to effectively fix the output format.
- optionalkeyword-onlyomit: Optional[List[str]] = None
  A list of fields which should be omitted from the items.
- optionalkeyword-onlyunwind: Optional[str] = None
  Name of a field which should be unwound. If the field is an array then every element of the array will become a separate record and merged with parent object. If the unwound field is an object then it is merged with the parent object. If the unwound field is missing or its value is neither an array nor an object and therefore cannot be merged with a parent object, then the item gets preserved as it is. Note that the unwound items ignore the desc parameter.
- optionalkeyword-onlyskip_empty: Optional[bool] = None
  If True, then empty items are skipped from the output. Note that if used, the results might contain less items than the limit value.
- optionalkeyword-onlyskip_hidden: Optional[bool] = None
  If True, then hidden fields are skipped from the output, i.e. fields starting with the # character.
Returns AsyncIterator[Dict]

open

async open(*, id, name, force_cloud, config): Dataset

Open a dataset.

Datasets are used to store structured data where each object stored has the same attributes, such as online store products or real estate offers. The actual data is stored either on the local filesystem or in the Apify cloud.
Parameters
- optionalkeyword-onlyid: Optional[str] = None
  ID of the dataset to be opened. If neither id nor name are provided, the method returns the default dataset associated with the actor run. If the dataset with the given ID does not exist, it raises an error.
- optionalkeyword-onlyname: Optional[str] = None
  Name of the dataset to be opened. If neither id nor name are provided, the method returns the default dataset associated with the actor run. If the dataset with the given name does not exist, it is created.
- optionalkeyword-onlyforce_cloud: bool = False
  If set to True, it will open a dataset on the Apify Platform even when running the actor locally. Defaults to False.
- optionalkeyword-onlyconfig: Optional[Configuration] = None
  A Configuration instance, uses global configuration if omitted.
Returns Dataset
Dataset: An instance of the Dataset class for the given ID or name.

push_data

async push_data(data): None

Store an object or an array of objects to the dataset.

The size of the data is limited by the receiving API and therefore push_data() will only allow objects whose JSON representation is smaller than 9MB. When an array is passed, none of the included objects may be larger than 9MB, but the array itself may be of any size.
Parameters
- data: JSONSerializable
  dict or array of dicts containing data to be stored in the default dataset. The JSON representation of each item must be smaller than 9MB.
Returns None

Index

Methods

Methods

drop

Returns None

export_to

Parameters

key: str

optionalkeyword-onlyto_key_value_store_id: Optional[str] = None

optionalkeyword-onlyto_key_value_store_name: Optional[str] = None

optionalkeyword-onlycontent_type: Optional[str] = None

Returns None

export_to_csv

Parameters

key: str

optionalkeyword-onlyfrom_dataset_id: Optional[str] = None

optionalkeyword-onlyfrom_dataset_name: Optional[str] = None

optionalkeyword-onlyto_key_value_store_id: Optional[str] = None

optionalkeyword-onlyto_key_value_store_name: Optional[str] = None

Returns None

export_to_json

Parameters

key: str

optionalkeyword-onlyfrom_dataset_id: Optional[str] = None

optionalkeyword-onlyfrom_dataset_name: Optional[str] = None

optionalkeyword-onlyto_key_value_store_id: Optional[str] = None

optionalkeyword-onlyto_key_value_store_name: Optional[str] = None

Returns None

get_data

Parameters

optionalkeyword-onlyoffset: Optional[int] = None

optionalkeyword-onlylimit: Optional[int] = None

optionalkeyword-onlyclean: Optional[bool] = None

optionalkeyword-onlydesc: Optional[bool] = None

optionalkeyword-onlyfields: Optional[List[str]] = None

optionalkeyword-onlyomit: Optional[List[str]] = None

optionalkeyword-onlyunwind: Optional[str] = None

optionalkeyword-onlyskip_empty: Optional[bool] = None

optionalkeyword-onlyskip_hidden: Optional[bool] = None

optionalkeyword-onlyflatten: Optional[List[str]] = None

optionalkeyword-onlyview: Optional[str] = None

Returns ListPage

get_info

Returns Optional[Dict]

iterate_items

Parameters

optionalkeyword-onlyoffset: int = 0

optionalkeyword-onlylimit: Optional[int] = None

optionalkeyword-onlyclean: Optional[bool] = None

optionalkeyword-onlydesc: Optional[bool] = None

optionalkeyword-onlyfields: Optional[List[str]] = None

optionalkeyword-onlyomit: Optional[List[str]] = None

optionalkeyword-onlyunwind: Optional[str] = None

optionalkeyword-onlyskip_empty: Optional[bool] = None

optionalkeyword-onlyskip_hidden: Optional[bool] = None

Returns AsyncIterator[Dict]

open

Parameters

optionalkeyword-onlyid: Optional[str] = None

optionalkeyword-onlyname: Optional[str] = None

optionalkeyword-onlyforce_cloud: bool = False

optionalkeyword-onlyconfig: Optional[Configuration] = None

Returns Dataset

push_data

Parameters

data: JSONSerializable

Returns None