search docs
Edit

Dataset

Store and export web scraping, crawling or data processing job results. Learn how to access and manage datasets in the Apify app or via API.

Dataset storage enables you to sequentially save and retrieve data. Each actor run is assigned its own dataset, which is created when the first item is stored to it.

Datasets usually contain results from web scraping, crawling or data processing jobs. The data can be visualized as a table where each object is a row and its attributes are the columns. The data can be exported in JSON, CSV, XML, RSS, Excel or HTML formats.

Named datasets are retained indefinitely.
Unnamed datasets expire after 7 days unless otherwise specified.
Learn about named and unnamed datasets.

Dataset storage is append-only - data can only be added and cannot be changed or deleted.

Basic usage

There are four ways to access your datasets:

Apify app

In the Apify app, you can view your datasets in the Storage section under the Datasets tab.

Only named datasets are displayed by default. Select the Include unnamed datasets checkbox to display all of your datasets.

Datasets in app

To view or download a dataset in the above mentioned formats, click on its Dataset ID. In the detail page, you can update the dataset's name (and, in turn, its retention period) and access rights under the Settings tab. The API tab allows you to view and test the dataset's API endpoints.

Datasets detail view

Apify SDK

If you are building an Apify actor, you will be using the Apify SDK. In the Apify SDK, the dataset is represented by the Dataset class.

You can use the Dataset class to specify whether your data is stored locally or on in the Apify cloud, push data to datasets of your choice using the pushData() method, and perform functions such as getData(), map() and reduce()(see example).

If you have chosen to store your dataset locally, you can find it in the location below.

{APIFY_LOCAL_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json

DATASET_ID refers to the dataset's name or ID. The default dataset will be stored in the default directory.

To add data to the default dataset, you can use the example below, however using the Apify.main() function is optional–it is only provided for your convenience.

1
2
3
4
5
6
7
8
9
10
11
12
13
// Import the Apify SDK into your project
const Apify = require("apify");

// The optional Apify.main() function performs the
// actor's job and terminates the process when it is finished
Apify.main(async () => {

    // Add one item to the default dataset
    await Apify.pushData({ foo: "bar" });

    // Add multiple items to the default dataset
    await Apify.pushData([{ foo: "hotel" }, { foo: "cafe" }]);
});

Make sure to use the await keyword when calling pushData(), otherwise the actor process might finish before the data are stored.

If you want to use something other than the default dataset, e.g. a dataset that you share between actors or between actor runs, you can use the Apify.openDataset() method.

1
2
3
4
5
// Save a named dataset to a variable
const dataset = await Apify.openDataset("some-name");

// Add data to the named dataset
await dataset.pushData({ foo: "bar" });

When using the getData() method, you can specify the data you retrieve using the [fields] parameter. It should be an array of field names (strings) that will be included in the results. To include all the results, simply omit the [fields] parameter.

1
2
3
4
// Only get the "hotel" and "cafe" fields
const hotelAndCafeData = await dataset.getData({
    fields: ["hotel", "cafe"]
});

For more information on managing datasets using the Apify SDK, see the SDK documentation and the Dataset class's API reference.

JavaScript API client

Apify's JavaScript API client (apify-client) allows you to access your datasets from any Node.js application, whether it is running on the Apify platform or elsewhere.

For help with setting up the client, see the JavaScript API client section on the overview page.

After importing the apify-client package into your application and creating an instance of it, save it to a variable for easier access.

1
2
// Save your datasets to a variable for easier access
const datasets = apifyClient.datasets;

You can then create, update, and delete datasets using the commands below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Get the dataset with the name "my-dataset"
// or create it if it doesn't exist
const dataset = await datasets.getOrCreateDataset({
    datasetName: "my-dataset",
});

// Set the dataset as the default to be used
// in the following commands
apifyClient.setOptions({ datasetId: dataset.id });

// Add an object and and array of objects to the dataset
await datasets.putItems({
    data: { foo: "bar" }
});
await datasets.putItems({
    data: [{ foo: "hotel" }, { foo: "cafe" }]
});

// Get items from a dataset
const paginationList = await datasets.getItems();
const items = paginationList.items;

// Delete a dataset
await datasets.deleteDataset();

When using the getItems() method, you can specify the data you retrieve using the [fields] parameter. It should be an array of field names (strings) that will be included in the results. To include all the results, simply omit the [fields] parameter.

1
2
3
4
// Only get the "hotel" and "cafe" fields
const hotelAndCafeData = await datasets.getItems({
    fields: ["hotel", "cafe"]
});

If you both specify and omit the same field in a request, the omit parameter will prevail and the field will not be returned.

For more information, see the JavaScript API client documentation.

Apify API

The Apify API allows you to access your datasets programmatically using HTTP requests and easily share your crawling results.

If you are accessing your datasets using the username~store-name store ID format, you will need to append your secret API token as a query parameter (see below). You can find the token (and your user ID) on the Integrations page of your Apify account.

To get a list of your datasets, send a GET request to the Get list of datasets endpoint, providing your API token as a query parameter.

https://api.apify.com/v2/datasets?token={YOUR_API_TOKEN}

To get information about a dataset such as its creation time and item count, send a GET request to the Get dataset endpoint.

https://api.apify.com/v2/datasets/{DATASET_ID}?token={YOUR_API_TOKEN}

To view a dataset's data, send a GET request to the Get dataset items Apify API endpoint.

https://api.apify.com/v2/datasets/{DATASET_ID}/items/?token={YOUR_API_TOKEN}

You can specify which data are exported by adding a comma-separated list of fields to the fields query parameter. Likewise, you can also omit certain fields using the omit parameter.

If you both specify and omit the same field in a request, the omit parameter will prevail and the field will not be returned.

To retrieve the hotel and cafe fields, you would send your GET request to the URL below.

https://api.apify.com/v2/datasets/{DATASET_ID}/items?token={YOUR_API_TOKEN}&fields=hotel%2Ccafe

Instead of commas, you will need to use the %2C code, which represents , in URL encoding.
Learn more about URL encoding here.

To add data to a dataset, send a POST request, with a JSON object containing the data you want to add as the payload to the Put items endpoint.

https://api.apify.com/v2/datasets/{DATASET_ID}/items/?token={YOUR_API_TOKEN}

Pushing data to dataset via API is limited to 200 requests per second to prevent our servers from being overloaded.

Example payload:

1
2
3
4
5
6
7
8
9
10
11
[
    {
        "foo": "bar"
    },
    {
        "foo": "hotel"
    },
    {
        "foo": "cafe"
    }
]

For a detailed breakdown of each API endpoint, see the API documentation.

Hidden fields

Top-level fields starting with the # character are considered hidden. These fields may be easily omitted when downloading the data from a dataset by providing the skipHidden=1 or clean=1 query parameters. This provides a convenient way to store debug information that should not appear in the final dataset.

Below is an example of a dataset record containing hidden fields with an HTTP response and error.

1
2
3
4
5
6
7
8
9
10
11
{
    "url": "https://example.com",
    "title": "Example page",
    "data": {
        "foo": "bar"
    },
    "#error": null,
    "#response": {
        "statusCode": 201
    }
}

Data without hidden fields are called "clean" and can be downloaded from the Apify app using the "Clean items" link or via API using the clean=true or clean=1 URL parameters.

XML format extension

When you export results to XML or RSS formats, object property names become XML tags, while the corresponding values become the tags' children.

For example, the JavaScript object:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
    name: "Rashida Jones",
    address: [
        {
            type: "home",
            street: "21st",
            city: "Chicago",
        },
        {
            type: "office",
            street: null,
            city: null,
        }
    ]
}

becomes the following XML snippet:

1
2
3
4
5
6
7
8
9
10
11
<name>Rashida Jones</name>
<address>
    <type>home</type>
    <street>21st</street>
    <city>Chicago</city>
</address>
<address>
    <type>office</type>
    <street/>
    <city/>
</address>

If the JavaScript object contains a property named @, its sub-properties are exported as attributes of the parent XML element. If the parent XML element does not have any child elements, its value is taken from a JavaScript object property named #.

For example, the following JavaScript object:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
    "address": [{
        "@": {
            "type": "home",
        },
        "street": "21st",
        "city": "Chicago",
    },
    {
        "@": {
            "type": "office",
        },
        "#": "unknown",
    }]
}

will be transformed to the following XML snippet:

1
2
3
4
5
<address type="home">
    <street>21st</street>
    <city>Chicago</city>
</address>
<address type="office">unknown</address>

This feature is also useful when customizing your RSS feeds generated for various websites.

By default, the whole result is wrapped in an <items/> emelent, while each page object is contained in an <item/> element. You can change this using the xmlRoot and xmlRow URL parameters when GETting your data.

Sharing

You can invite other Apify users to view or modify your datasets using the access rights system. See the full list of permissions here.

Sharing datasets between runs

You can access a dataset from any actor or task run as long as you know its name or ID.

To access a dataset from another run using the Apify SDK, open it using the Apify.openDataset([datasetIdOrName]) method like you would any other dataset.

1
const otherDataset = await Apify.openDataset("old-dataset");

To access a dataset using the JavaScript API client, use the getOrCreateDataset() method.

1
2
3
const otherDataset = await datasets.getOrCreateDataset({
    datasetName: "my-dataset",
});

Once you've opened the dataset, read its contents and add new data like you would with a dataset from your current run.

The same applies for the Apify API - you can use the same endpoints as you would normally.

For more information on sharing storages between runs, see the Storage overview page.

Limits

  • Tabulated data storage formats (ones that display the data in columns), such as HTML, CSV, and EXCEL, have a maximum limit of 3000 columns. All data that do not fit into this limit will not be retrieved.

  • When using the pushData() method, the size of the data is limited by the receiving API. Therefore, pushData() will only allow objects whose JSON representation is smaller than 9MB. When an array is passed, none of the included objects may be larger than 9MB, however the array itself may be of any size.

  • Dataset names can be up to 63 characters long.

Rate limiting

When pushing data to a dataset via API, the request rate is limited to 200 per second per dataset. This helps protect Apify servers from being overloaded.

All other dataset API endpoints are limited to 30 requests per second per dataset.

See the API documentation for more details and to learn what to do if you exceed the rate limit.