Dataset Schema Specification
Learn how to define and present your dataset schema in an user-friendly output UI.
The dataset schema defines the structure and representation of data produced by an Actor, both in the API and the visual user interface.
Example
Let's consider an example Actor that calls Actor.pushData()
to store data into dataset:
import { Actor } from 'apify';
// Initialize the JavaScript SDK
await Actor.init();
/**
* Actor code
*/
await Actor.pushData({
numericField: 10,
pictureUrl: 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',
linkUrl: 'https://google.com',
textField: 'Google',
booleanField: true,
dateField: new Date(),
arrayField: ['#hello', '#world'],
objectField: {},
});
// Exit successfully
await Actor.exit();
To set up the Actor's output tab UI using a single configuration file, use the following template for the .actor/actor.json
configuration:
{
"actorSpecification": 1,
"name": "Actor Name",
"title": "Actor Title",
"version": "1.0.0",
"storages": {
"dataset": {
"actorSpecification": 1,
"views": {
"overview": {
"title": "Overview",
"transformation": {
"fields": [
"pictureUrl",
"linkUrl",
"textField",
"booleanField",
"arrayField",
"objectField",
"dateField",
"numericField"
]
},
"display": {
"component": "table",
"properties": {
"pictureUrl": {
"label": "Image",
"format": "image"
},
"linkUrl": {
"label": "Link",
"format": "link"
},
"textField": {
"label": "Text",
"format": "text"
},
"booleanField": {
"label": "Boolean",
"format": "boolean"
},
"arrayField": {
"label": "Array",
"format": "array"
},
"objectField": {
"label": "Object",
"format": "object"
},
"dateField": {
"label": "Date",
"format": "date"
},
"numericField": {
"label": "Number",
"format": "number"
}
}
}
}
}
}
}
}
The template above defines the configuration for the default dataset output view. Under the views
property, there is one view titled Overview. The view configuration consists of two main steps:
transformation
- set up how to fetch the data.display
- set up how to visually present the fetched data.
The default behavior of the Output tab UI table is to display all fields from transformation.fields
in the specified order. You can customize the display properties for specific formats or column labels if needed.
Structure
Output configuration files need to be located in the .actor
folder within the Actor's root directory.
You have two choices of how to organize files within the .actor
folder.
Single configuration file
{
"actorSpecification": 1,
"name": "this-is-book-library-scraper",
"title": "Book Library scraper",
"version": "1.0.0",
"storages": {
"dataset": {
"actorSpecification": 1,
"fields": {},
"views": {
"overview": {
"title": "Overview",
"transformation": {},
"display": {}
}
}
}
}
}
Separate configuration files
{
"actorSpecification": 1,
"name": "this-is-book-library-scraper",
"title": "Book Library scraper",
"version": "1.0.0",
"storages": {
"dataset": "./dataset_schema.json"
}
}
{
"actorSpecification": 1,
"fields": {},
"views": {
"overview": {
"title": "Overview",
"transformation": {},
"display": {}
}
}
}
Both of these methods are valid so choose one that suits your needs best.
Handle nested structures
The most frequently used data formats present the data in a tabular format (Output tab table, Excel, CSV). If your Actor produces nested JSON structures, you need to transform the nested data into a flat tabular format. You can flatten the data in the following ways:
-
Use
transformation.flatten
to flatten the nested structure of specified fields. This transforms the nested object into a flat structure. e.g. withflatten:["foo"]
, the object{"foo": {"bar": "hello"}}
is turned into{"foo.bar": "hello"}
. Once the structure is flattened, it's necessary to use the flattened property name in bothtransformation.fields
anddisplay.properties
, otherwise, fields might not be fetched or configured properly in the UI visualization. -
Use
transformation.unwind
to deconstruct the nested children into parent objects. -
Change the output structure in an Actor from nested to flat before the results are saved in the dataset.
Dataset schema structure definitions
The dataset schema structure defines the various components and properties that govern the organization and representation of the output data produced by an Actor. It specifies the structure of the data, the transformations to be applied, and the visual display configurations for the Output tab UI.
DatasetSchema object definition
Property | Type | Required | Description |
---|---|---|---|
actorSpecification | integer | true | Specifies the version of dataset schema structure document. Currently only version 1 is available. |
fields | JSONSchema compatible object | true | Schema of one dataset object. Use JsonSchema Draft 2020–12 or other compatible formats. |
views | DatasetView object | true | An object with a description of an API and UI views. |
DatasetView object definition
Property | Type | Required | Description |
---|---|---|---|
title | string | true | The title is visible in UI in the Output tab and in the API. |
description | string | false | The description is only available in the API response. |
transformation | ViewTransformation object | true | The definition of data transformation applied when dataset data is loaded from Dataset API. |
display | ViewDisplay object | true | The definition of Output tab UI visualization. |
ViewTransformation object definition
Property | Type | Required | Description |
---|---|---|---|
fields | string[] | true | Selects fields to be presented in the output. The order of fields matches the order of columns in visualization UI. If a field value is missing, it will be presented as undefined in the UI. |
unwind | string | false | Deconstructs nested children into parent object, For example, with unwind:["foo"] , the object {"foo": {"bar": "hello"}} is transformed into {"bar": "hello"} . |
flatten | string[] | false | Transforms nested object into flat structure. For example, with flatten:["foo"] the object {"foo":{"bar": "hello"}} is transformed into {"foo.bar": "hello"} . |
omit | string | false | Removes the specified fields from the output. Nested fields names can be used as well. |
limit | integer | false | The maximum number of results returned. Default is all results. |
desc | boolean | false | By default, results are sorted in ascending based on the write event into the dataset. If desc:true , the newest writes to the dataset will be returned first. |
ViewDisplay object definition
Property | Type | Required | Description |
---|---|---|---|
component | string | true | Only the table component is available. |
properties | Object | false | An object with keys matching the transformation.fields and ViewDisplayProperty as values. If properties are not set, the table will be rendered automatically with fields formatted as strings , arrays or objects . |
ViewDisplayProperty object definition
Property | Type | Required | Description |
---|---|---|---|
label | string | false | In the Table view, the label will be visible as the table column's header. |
format | One of
| false | Describes how output data values are formatted to be rendered in the Output tab UI. |