Dataset Schema Specification

Learn how to define and present your dataset schema in an user-friendly output UI.

The dataset schema defines the structure and representation of data produced by an Actor, both in the API and the visual user interface.

Example

Let's consider an example Actor that calls Actor.pushData() to store data into dataset:

main.js
import { Actor } from 'apify';
// Initialize the JavaScript SDK
await Actor.init();

/**
 * Actor code
 */
await Actor.pushData({
    numericField: 10,
    pictureUrl: 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',
    linkUrl: 'https://google.com',
    textField: 'Google',
    booleanField: true,
    dateField: new Date(),
    arrayField: ['#hello', '#world'],
    objectField: {},
});


// Exit successfully
await Actor.exit();

To set up the Actor's output tab UI using a single configuration file, use the following template for the .actor/actor.json configuration:

.actor/actor.json
{
    "actorSpecification": 1,
    "name": "Actor Name",
    "title": "Actor Title",
    "version": "1.0.0",
    "storages": {
        "dataset": {
            "actorSpecification": 1,
            "views": {
                "overview": {
                    "title": "Overview",
                    "transformation": {
                        "fields": [
                            "pictureUrl",
                            "linkUrl",
                            "textField",
                            "booleanField",
                            "arrayField",
                            "objectField",
                            "dateField",
                            "numericField"
                        ]
                    },
                    "display": {
                        "component": "table",
                        "properties": {
                            "pictureUrl": {
                                "label": "Image",
                                "format": "image"
                            },
                            "linkUrl": {
                                "label": "Link",
                                "format": "link"
                            },
                            "textField": {
                                "label": "Text",
                                "format": "text"
                            },
                            "booleanField": {
                                "label": "Boolean",
                                "format": "boolean"
                            },
                            "arrayField": {
                                "label": "Array",
                                "format": "array"
                            },
                            "objectField": {
                                "label": "Object",
                                "format": "object"
                            },
                            "dateField": {
                                "label": "Date",
                                "format": "date"
                            },
                            "numericField": {
                                "label": "Number",
                                "format": "number"
                            }
                        }
                    }
                }
            }
        }
    }
}

The template above defines the configuration for the default dataset output view. Under the views property, there is one view titled Overview. The view configuration consists of two main steps:

transformation - set up how to fetch the data.
display - set up how to visually present the fetched data.

The default behavior of the Output tab UI table is to display all fields from transformation.fields in the specified order. You can customize the display properties for specific formats or column labels if needed.

Output tab UI

Structure

Output configuration files need to be located in the .actor folder within the Actor's root directory.

You have two choices of how to organize files within the .actor folder.

Single configuration file

.actor/actor.json
{
    "actorSpecification": 1,
    "name": "this-is-book-library-scraper",
    "title": "Book Library scraper",
    "version": "1.0.0",
    "storages": {
        "dataset": {
            "actorSpecification": 1,
            "fields": {},
            "views": {
                "overview": {
                    "title": "Overview",
                    "transformation": {},
                    "display": {}
                }
            }
        }
    }
}

Separate configuration files

.actor/actor.json
{
    "actorSpecification": 1,
    "name": "this-is-book-library-scraper",
    "title": "Book Library scraper",
    "version": "1.0.0",
    "storages": {
        "dataset": "./dataset_schema.json"
    }
}

.actor/dataset_schema.json
{
    "actorSpecification": 1,
    "fields": {},
    "views": {
        "overview": {
            "title": "Overview",
            "transformation": {},
            "display": {}
        }
    }
}

Both of these methods are valid so choose one that suits your needs best.

Handle nested structures

The most frequently used data formats present the data in a tabular format (Output tab table, Excel, CSV). If your Actor produces nested JSON structures, you need to transform the nested data into a flat tabular format. You can flatten the data in the following ways:

Use transformation.flatten to flatten the nested structure of specified fields. This transforms the nested object into a flat structure. e.g. with flatten:["foo"], the object {"foo": {"bar": "hello"}} is turned into {"foo.bar": "hello"}. Once the structure is flattened, it's necessary to use the flattened property name in both transformation.fields and display.properties, otherwise, fields might not be fetched or configured properly in the UI visualization.
Use transformation.unwind to deconstruct the nested children into parent objects.
Change the output structure in an Actor from nested to flat before the results are saved in the dataset.

Dataset schema structure definitions

The dataset schema structure defines the various components and properties that govern the organization and representation of the output data produced by an Actor. It specifies the structure of the data, the transformations to be applied, and the visual display configurations for the Output tab UI.

DatasetSchema object definition

Property	Type	Required	Description
`actorSpecification`	integer	true	Specifies the version of dataset schema structure document. Currently only version 1 is available.
`fields`	JSONSchema compatible object	true	Schema of one dataset object. Use JsonSchema Draft 2020–12 or other compatible formats.
`views`	DatasetView object	true	An object with a description of an API and UI views.

DatasetView object definition

Property	Type	Required	Description
`title`	string	true	The title is visible in UI in the Output tab and in the API.
`description`	string	false	The description is only available in the API response.
`transformation`	ViewTransformation object	true	The definition of data transformation applied when dataset data is loaded from Dataset API.
`display`	ViewDisplay object	true	The definition of Output tab UI visualization.

ViewTransformation object definition

Property	Type	Required	Description
`fields`	string[]	true	Selects fields to be presented in the output. The order of fields matches the order of columns in visualization UI. If a field value is missing, it will be presented as undefined in the UI.
`unwind`	string	false	Deconstructs nested children into parent object, For example, with `unwind:["foo"]`, the object `{"foo": {"bar": "hello"}}` is transformed into `{"bar": "hello"}`.
`flatten`	string[]	false	Transforms nested object into flat structure. For example, with `flatten:["foo"]` the object `{"foo":{"bar": "hello"}}` is transformed into `{"foo.bar": "hello"}`.
`omit`	string	false	Removes the specified fields from the output. Nested fields names can be used as well.
`limit`	integer	false	The maximum number of results returned. Default is all results.
`desc`	boolean	false	By default, results are sorted in ascending based on the write event into the dataset. If `desc:true`, the newest writes to the dataset will be returned first.

ViewDisplay object definition

Property	Type	Required	Description
`component`	string	true	Only the `table` component is available.
`properties`	Object	false	An object with keys matching the `transformation.fields` and `ViewDisplayProperty` as values. If properties are not set, the table will be rendered automatically with fields formatted as `strings`, `arrays` or `objects`.

ViewDisplayProperty object definition

Property	Type	Required	Description
`label`	string	false	In the Table view, the label will be visible as the table column's header.
`format`	One of `text` `number` `date` `link` `boolean` `image` `array` `object`	false	Describes how output data values are formatted to be rendered in the Output tab UI.

Example​

Structure​

Single configuration file​

Separate configuration files​

Handle nested structures​

Dataset schema structure definitions​

DatasetSchema object definition​

DatasetView object definition​

ViewTransformation object definition​

ViewDisplay object definition​

ViewDisplayProperty object definition​