Skip to main content

Dataset Schema Specification

Learn how to define and present your dataset schema in an user-friendly output UI.


The dataset schema defines the structure and representation of data produced by an Actor, both in the API and the visual user interface.

Example

Let's consider an example Actor that calls Actor.pushData() to store data into dataset:

main.js
import { Actor } from 'apify';
// Initialize the JavaScript SDK
await Actor.init();

/**
* Actor code
*/
await Actor.pushData({
numericField: 10,
pictureUrl: 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',
linkUrl: 'https://google.com',
textField: 'Google',
booleanField: true,
dateField: new Date(),
arrayField: ['#hello', '#world'],
objectField: {},
});


// Exit successfully
await Actor.exit();

To set up the Actor's output tab UI using a single configuration file, use the following template for the .actor/actor.json configuration:

.actor/actor.json
{
"actorSpecification": 1,
"name": "Actor Name",
"title": "Actor Title",
"version": "1.0.0",
"storages": {
"dataset": {
"actorSpecification": 1,
"views": {
"overview": {
"title": "Overview",
"transformation": {
"fields": [
"pictureUrl",
"linkUrl",
"textField",
"booleanField",
"arrayField",
"objectField",
"dateField",
"numericField"
]
},
"display": {
"component": "table",
"properties": {
"pictureUrl": {
"label": "Image",
"format": "image"
},
"linkUrl": {
"label": "Link",
"format": "link"
},
"textField": {
"label": "Text",
"format": "text"
},
"booleanField": {
"label": "Boolean",
"format": "boolean"
},
"arrayField": {
"label": "Array",
"format": "array"
},
"objectField": {
"label": "Object",
"format": "object"
},
"dateField": {
"label": "Date",
"format": "date"
},
"numericField": {
"label": "Number",
"format": "number"
}
}
}
}
}
}
}
}

The template above defines the configuration for the default dataset output view. Under the views property, there is one view titled Overview. The view configuration consists of two main steps:

  1. transformation - set up how to fetch the data.
  2. display - set up how to visually present the fetched data.

The default behavior of the Output tab UI table is to display all fields from transformation.fields in the specified order. You can customize the display properties for specific formats or column labels if needed.

Output tab UI

Structure

Output configuration files need to be located in the .actor folder within the Actor's root directory.

You have two choices of how to organize files within the .actor folder.

Single configuration file

.actor/actor.json
{
"actorSpecification": 1,
"name": "this-is-book-library-scraper",
"title": "Book Library scraper",
"version": "1.0.0",
"storages": {
"dataset": {
"actorSpecification": 1,
"fields": {},
"views": {
"overview": {
"title": "Overview",
"transformation": {},
"display": {}
}
}
}
}
}

Separate configuration files

.actor/actor.json
{
"actorSpecification": 1,
"name": "this-is-book-library-scraper",
"title": "Book Library scraper",
"version": "1.0.0",
"storages": {
"dataset": "./dataset_schema.json"
}
}
.actor/dataset_schema.json
{
"actorSpecification": 1,
"fields": {},
"views": {
"overview": {
"title": "Overview",
"transformation": {},
"display": {}
}
}
}

Both of these methods are valid so choose one that suits your needs best.

Handle nested structures

The most frequently used data formats present the data in a tabular format (Output tab table, Excel, CSV). If your Actor produces nested JSON structures, you need to transform the nested data into a flat tabular format. You can flatten the data in the following ways:

  • Use transformation.flatten to flatten the nested structure of specified fields. This transforms the nested object into a flat structure. e.g. with flatten:["foo"], the object {"foo": {"bar": "hello"}} is turned into {"foo.bar": "hello"}. Once the structure is flattened, it's necessary to use the flattened property name in both transformation.fields and display.properties, otherwise, fields might not be fetched or configured properly in the UI visualization.

  • Use transformation.unwind to deconstruct the nested children into parent objects.

  • Change the output structure in an Actor from nested to flat before the results are saved in the dataset.

Dataset schema structure definitions

The dataset schema structure defines the various components and properties that govern the organization and representation of the output data produced by an Actor. It specifies the structure of the data, the transformations to be applied, and the visual display configurations for the Output tab UI.

DatasetSchema object definition

PropertyTypeRequiredDescription
actorSpecificationintegertrueSpecifies the version of dataset schema
structure document.
Currently only version 1 is available.
fieldsJSONSchema compatible objecttrueSchema of one dataset object.
Use JsonSchema Draft 2020–12 or
other compatible formats.
viewsDatasetView objecttrueAn object with a description of an API
and UI views.

DatasetView object definition

PropertyTypeRequiredDescription
titlestringtrueThe title is visible in UI in the Output tab
and in the API.
descriptionstringfalseThe description is only available in the API response.
transformationViewTransformation objecttrueThe definition of data transformation
applied when dataset data is loaded from
Dataset API.
displayViewDisplay objecttrueThe definition of Output tab UI visualization.

ViewTransformation object definition

PropertyTypeRequiredDescription
fieldsstring[]trueSelects fields to be presented in the output.
The order of fields matches the order of columns
in visualization UI. If a field value
is missing, it will be presented as undefined in the UI.
unwindstringfalseDeconstructs nested children into parent object,
For example, with unwind:["foo"], the object {"foo": {"bar": "hello"}}
is transformed into {"bar": "hello"}.
flattenstring[]falseTransforms nested object into flat structure.
For example, with flatten:["foo"] the object {"foo":{"bar": "hello"}}
is transformed into {"foo.bar": "hello"}.
omitstringfalseRemoves the specified fields from the output.
Nested fields names can be used as well.
limitintegerfalseThe maximum number of results returned.
Default is all results.
descbooleanfalseBy default, results are sorted in ascending based on the write event into the dataset.
If desc:true, the newest writes to the dataset will be returned first.

ViewDisplay object definition

PropertyTypeRequiredDescription
componentstringtrueOnly the table component is available.
propertiesObjectfalseAn object with keys matching the transformation.fields
and ViewDisplayProperty as values. If properties are not set, the table will be rendered automatically with fields formatted as strings, arrays or objects.

ViewDisplayProperty object definition

PropertyTypeRequiredDescription
labelstringfalseIn the Table view, the label will be visible as the table column's header.
formatOne of
  • text
  • number
  • date
  • link
  • boolean
  • image
  • array
  • object
falseDescribes how output data values are formatted to be rendered in the Output tab UI.