Edit

Dataset

Store and export web scraping, crawling or data processing job results. Learn how to access and manage datasets in the Apify console or via API.

Dataset storage enables you to sequentially save and retrieve data. Each actor run is assigned its own dataset, which is created when the first item is stored to it.

Datasets usually contain results from web scraping, crawling or data processing jobs. The data can be visualized as a table where each object is a row and its attributes are the columns. The data can be exported in JSON, CSV, XML, RSS, Excel or HTML formats.

Named datasets are retained indefinitely.
Unnamed datasets expire after 7 days unless otherwise specified.
Learn about named and unnamed datasets.

Dataset storage is append-only - data can only be added and cannot be changed or deleted.

Basic usage

There are five ways to access your datasets:

Apify console

In the Apify console, you can view your datasets in the Storage section under the Datasets tab.

Only named datasets are displayed by default. Select the Include unnamed datasets checkbox to display all of your datasets.

Datasets in app

To view or download a dataset in the above mentioned formats, click on its Dataset ID. In the detail page, you can update the dataset's name (and, in turn, its retention period) and access rights under the Settings tab. The API tab allows you to view and test the dataset's API endpoints.

Datasets detail view

Apify SDK

If you are building an Apify actor, you will be using the Apify SDK. In the Apify SDK, the dataset is represented by the Dataset class.

You can use the Dataset class to specify whether your data is stored locally or on in the Apify cloud, push data to datasets of your choice using the pushData() method, and perform functions such as getData(), map() and reduce()(see example).

If you have chosen to store your dataset locally, you can find it in the location below.

{APIFY_LOCAL_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json

DATASET_ID refers to the dataset's name or ID. The default dataset will be stored in the default directory.

To add data to the default dataset, you can use the example below, however using the Apify.main() function is optional–it is only provided for your convenience.

// Import the Apify SDK into your project
const Apify = require('apify');

// The optional Apify.main() function performs the
// actor's job and terminates the process when it is finished
Apify.main(async () => {
    // Add one item to the default dataset
    await Apify.pushData({ foo: 'bar' });

    // Add multiple items to the default dataset
    await Apify.pushData([{ foo: 'hotel' }, { foo: 'cafe' }]);
});

Make sure to use the await keyword when calling pushData(), otherwise the actor process might finish before the data are stored.

If you want to use something other than the default dataset, e.g. a dataset that you share between actors or between actor runs, you can use the Apify.openDataset() method.

// Save a named dataset to a variable
const dataset = await Apify.openDataset('some-name');

// Add data to the named dataset
await dataset.pushData({ foo: 'bar' });

When using the getData() method, you can specify the data you retrieve using the [fields] parameter. It should be an array of field names (strings) that will be included in the results. To include all the results, exclude the [fields] parameter.

// Only get the "hotel" and "cafe" fields
const hotelAndCafeData = await dataset.getData({
    fields: ['hotel', 'cafe'],
});

See the SDK documentation and the Dataset class's API reference for details on managing datasets with the Apify SDK.

JavaScript API client

Apify's JavaScript API client (apify-client) allows you to access your datasets from any Node.js application, whether it is running on the Apify platform or elsewhere.

After importing and initiating the client, you can save each dataset to a variable for easier access.

const myDatasetClient = apifyClient.dataset('jane-doe/my-dataset');

You can then use that variable to access the dataset's items and manage it.

Note: When using the .listItems() method, if you mention the same field name in the field and omit parameters, the omit parameter will prevail and the field will not be returned.

See the JavaScript API client documentation for help with setup and more details.

Python API client

Apify's Python API client (apify-client) allows you to access your datasets from any Python application, whether it is running on the Apify platform or elsewhere.

After importing and initiating the client, you can save each dataset to a variable for easier access.

my_dataset_client = apify_client.dataset('jane-doe/my-dataset')

You can then use that variable to access the dataset's items and manage it.

Note: When using the .list_items() method, if you mention the same field name in the field and omit parameters, the omit parameter will prevail and the field will not be returned.

See the Python API client documentation for help with setup and more details.

Apify API

The Apify API allows you to access your datasets programmatically using HTTP requests and easily share your crawling results.

If you are accessing your datasets using the username~store-name store ID format, you will need to use your secret API token. You can find the token (and your user ID) on the Integrations page of your Apify account.

When providing your API authentication token, we recommend using the request's Authorization header, rather than the URL. (More info).

To get a list of your datasets, send a GET request to the Get list of datasets endpoint.

https://api.apify.com/v2/datasets

To get information about a dataset such as its creation time and item count, send a GET request to the Get dataset endpoint.

https://api.apify.com/v2/datasets/{DATASET_ID}

To view a dataset's data, send a GET request to the Get dataset items Apify API endpoint.

https://api.apify.com/v2/datasets/{DATASET_ID}/items

You can specify which data are exported by adding a comma-separated list of fields to the fields query parameter. Likewise, you can also omit certain fields using the omit parameter.

If you both specify and omit the same field in a request, the omit parameter will prevail and the field will not be returned.

In addition, you can set the format in which you retrieve the data using the ?format= parameter. The available formats are json, jsonl, csv, html, xlsx, xml and rss. The default value is json.

To retrieve the hotel and cafe fields, you would send your GET request to the URL below.

https://api.apify.com/v2/datasets/{DATASET_ID}/items?format=json&fields=hotel%2Ccafe

Instead of commas, you will need to use the %2C code, which represents , in URL encoding.
Learn more about URL encoding here.

To add data to a dataset, send a POST request, with a JSON object containing the data you want to add as the payload to the Put items endpoint.

https://api.apify.com/v2/datasets/{DATASET_ID}/items

Pushing data to dataset via API is limited to 200 requests per second to prevent our servers from being overloaded.

Example payload:

[
    {
        "foo": "bar"
    },
    {
        "foo": "hotel"
    },
    {
        "foo": "cafe"
    }
]

See the API documentation for a detailed breakdown of each API endpoint.

Hidden fields

Top-level fields starting with the # character are considered hidden. These fields may be easily omitted when downloading the data from a dataset by providing the skipHidden=1 or clean=1 query parameters. This provides a convenient way to store debug information that should not appear in the final dataset.

Below is an example of a dataset record containing hidden fields with an HTTP response and error.

{
    "url": "https://example.com",
    "title": "Example page",
    "data": {
        "foo": "bar"
    },
    "#error": null,
    "#response": {
        "statusCode": 201
    }
}

Data without hidden fields are called "clean" and can be downloaded from the Apify console using the "Clean items" link or via API using the clean=true or clean=1 URL parameters.

XML format extension

When you export results to XML or RSS formats, object property names become XML tags, while the corresponding values become the tags' children.

For example, the JavaScript object:

{
    name: "Rashida Jones",
    address: [
        {
            type: "home",
            street: "21st",
            city: "Chicago",
        },
        {
            type: "office",
            street: null,
            city: null,
        }
    ]
}

becomes the following XML snippet:

<name>Rashida Jones</name>
<address>
    <type>home</type>
    <street>21st</street>
    <city>Chicago</city>
</address>
<address>
    <type>office</type>
    <street/>
    <city/>
</address>

If the JavaScript object contains a property named @, its sub-properties are exported as attributes of the parent XML element. If the parent XML element does not have any child elements, its value is taken from a JavaScript object property named #.

For example, the following JavaScript object:

{
    "address": [{
        "@": {
            "type": "home",
        },
        "street": "21st",
        "city": "Chicago",
    },
    {
        "@": {
            "type": "office",
        },
        "#": "unknown",
    }]
}

will be transformed to the following XML snippet:

<address type="home">
    <street>21st</street>
    <city>Chicago</city>
</address>
<address type="office">unknown</address>

This feature is also useful when customizing your RSS feeds generated for various websites.

By default, the whole result is wrapped in an <items/> element, while each page object is contained in an <item/> element. You can change this using the xmlRoot and xmlRow URL parameters when GETting your data.

Sharing

You can invite other Apify users to view or modify your datasets using the access rights system. See the full list of permissions.

Sharing datasets between runs

You can access a dataset from any actor or task run as long as you know its name or ID.

To access a dataset from another run using the Apify SDK, open it using the Apify.openDataset([datasetIdOrName]) method like you would any other dataset.

const otherDataset = await Apify.openDataset('old-dataset');

In the JavaScript API client, you can access a dataset using its client. Once you've opened the dataset, read its contents and add new data like you would with a dataset from your current run.

const otherDatasetClient = apifyClient.dataset('jane-doe/old-dataset');

Likewise, in the Python API client, you can access a dataset using its client.

other_dataset_client = apify_client.dataset('jane-doe/old-dataset')

The same applies for the Apify API - you can use the same endpoints as you would normally.

See the Storage overview for details on sharing storages between runs.

Limits

  • Tabulated data storage formats (ones that display the data in columns), such as HTML, CSV, and EXCEL, have a maximum limit of 3000 columns. All data that do not fit into this limit will not be retrieved.

  • When using the pushData() method, the size of the data is limited by the receiving API. Therefore, pushData() will only allow objects whose JSON representation is smaller than 9MB. When an array is passed, none of the included objects may be larger than 9MB, however the array itself may be of any size.

  • Dataset names can be up to 63 characters long.

Rate limiting

When pushing data to a dataset via API, the request rate is limited to 200 per second per dataset. This helps protect Apify servers from being overloaded.

All other dataset API endpoints are limited to 30 requests per second per dataset.

See the API documentation for details and to learn what to do if you exceed the rate limit.