Skip to main content
Version: 3.0

externalDataset <Data>

The Dataset class represents a store for structured data where each object stored has the same attributes, such as online store products or real estate offers. You can imagine it as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage - you can only add new records to it but you cannot modify or remove existing records. Typically it is used to store crawling results.

Do not instantiate this class directly, use the Dataset.open function instead.

Dataset stores its data either on local disk or in the Apify cloud, depending on whether the APIFY_LOCAL_STORAGE_DIR or APIFY_TOKEN environment variables are set.

If the APIFY_LOCAL_STORAGE_DIR environment variable is set, the data is stored in the local directory in the following files:

{APIFY_LOCAL_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json

Note that {DATASET_ID} is the name or ID of the dataset. The default dataset has ID: default, unless you override it by setting the APIFY_DEFAULT_DATASET_ID environment variable. Each dataset item is stored as a separate JSON file, where {INDEX} is a zero-based index of the item in the dataset.

If the APIFY_TOKEN environment variable is set but APIFY_LOCAL_STORAGE_DIR not, the data is stored in the Apify Dataset cloud storage. Note that you can force usage of the cloud storage also by passing the forceCloud option to Dataset.open function, even if the APIFY_LOCAL_STORAGE_DIR variable is set.

Example usage:

// Write a single row to the default dataset
await Dataset.pushData({ col1: 123, col2: 'val2' });

// Open a named dataset
const dataset = await Dataset.open('some-name');

// Write a single row
await dataset.pushData({ foo: 'bar' });

// Write multiple rows
await dataset.pushData([
{ foo: 'bar2', col2: 'val2' },
{ col3: 123 },
]);

Index

Properties

externalclient

client: DatasetClient<Data>

externalreadonlyconfig

config: Configuration

externalid

id: string

externallog

log: Log

externaloptionalname

name?: string

Methods

externaldrop

  • drop(): Promise<void>
  • Removes the dataset either from the Apify cloud storage or from the local directory, depending on the mode of operation.


    Returns Promise<void>

externalforEach

  • forEach(iteratee, options, index): Promise<void>
  • Iterates over dataset items, yielding each in turn to an iteratee function. Each invocation of iteratee is called with two arguments: (item, index).

    If the iteratee function returns a Promise then it is awaited before the next call. If it throws an error, the iteration is aborted and the forEach function throws the error.

    Example usage

    const dataset = await Dataset.open('my-results');
    await dataset.forEach(async (item, index) => {
    console.log(`Item at ${index}: ${JSON.stringify(item)}`);
    });

    Parameters

    • iteratee: DatasetConsumer<Data>external

      A function that is called for every item in the dataset.

    • options: DatasetIteratorOptionsexternaloptional

      All forEach() parameters.

    • index: numberexternaloptional

      Specifies the initial index number passed to the iteratee function.

    Returns Promise<void>

externalgetData

  • getData(options): Promise<PaginatedList<Data>>
  • Returns DatasetContent object holding the items in the dataset based on the provided parameters.


    Parameters

    • options: DatasetDataOptionsexternaloptional

    Returns Promise<PaginatedList<Data>>

externalgetInfo

  • getInfo(): Promise<undefined | DatasetInfo>
  • Returns an object containing general information about the dataset.

    The function returns the same object as the Apify API Client's getDataset function, which in turn calls the Get dataset API endpoint.

    Example:

    {
    id: "WkzbQMuFYuamGv3YF",
    name: "my-dataset",
    userId: "wRsJZtadYvn4mBZmm",
    createdAt: new Date("2015-12-12T07:34:14.202Z"),
    modifiedAt: new Date("2015-12-13T08:36:13.202Z"),
    accessedAt: new Date("2015-12-14T08:36:13.202Z"),
    itemCount: 14,
    }

    Returns Promise<undefined | DatasetInfo>

externalmap

  • map<R>(iteratee, options): Promise<R[]>
  • Produces a new array of values by mapping each value in list through a transformation function iteratee(). Each invocation of iteratee() is called with two arguments: (element, index).

    If iteratee returns a Promise then it's awaited before a next call.


    Type parameters

    • R

    Parameters

    • iteratee: DatasetMapper<Data, R>external
    • options: DatasetIteratorOptionsexternaloptional

      All map() parameters.

    Returns Promise<R[]>

externalpushData

  • pushData(data): Promise<void>
  • Stores an object or an array of objects to the dataset. The function returns a promise that resolves when the operation finishes. It has no result, but throws on invalid args or other errors.

    IMPORTANT: Make sure to use the await keyword when calling pushData(), otherwise the crawler process might finish before the data is stored!

    The size of the data is limited by the receiving API and therefore pushData() will only allow objects whose JSON representation is smaller than 9MB. When an array is passed, none of the included objects may be larger than 9MB, but the array itself may be of any size.

    The function internally chunks the array into separate items and pushes them sequentially. The chunking process is stable (keeps order of data), but it does not provide a transaction safety mechanism. Therefore, in the event of an uploading error (after several automatic retries), the function's Promise will reject and the dataset will be left in a state where some of the items have already been saved to the dataset while other items from the source array were not. To overcome this limitation, the developer may, for example, read the last item saved in the dataset and re-attempt the save of the data from this item onwards to prevent duplicates.


    Parameters

    • data: Data | Data[]external

      Object or array of objects containing data to be stored in the default dataset. The objects must be serializable to JSON and the JSON representation of each object must be smaller than 9MB.

    Returns Promise<void>

externalreduce

  • reduce<T>(iteratee, memo, options): Promise<T>
  • Reduces a list of values down to a single value.

    Memo is the initial state of the reduction, and each successive step of it should be returned by iteratee(). The iteratee() is passed three arguments: the memo, then the value and index of the iteration.

    If no memo is passed to the initial invocation of reduce, the iteratee() is not invoked on the first element of the list. The first element is instead passed as the memo in the invocation of the iteratee() on the next element in the list.

    If iteratee() returns a Promise then it's awaited before a next call.


    Type parameters

    • T

    Parameters

    • iteratee: DatasetReducer<T, Data>external
    • memo: Texternal

      Initial state of the reduction.

    • options: DatasetIteratorOptionsexternaloptional

      All reduce() parameters.

    Returns Promise<T>

staticexternalopen

  • open<Data>(datasetIdOrName, options): Promise<Dataset<Data>>
  • Opens a dataset and returns a promise resolving to an instance of the Dataset class.

    Datasets are used to store structured data where each object stored has the same attributes, such as online store products or real estate offers. The actual data is stored either on the local filesystem or in the cloud.

    For more details and code examples, see the Dataset class.


    Type parameters

    • Data: Dictionary<any> = Dictionary<any>

    Parameters

    • datasetIdOrName: null | stringexternaloptional

      ID or name of the dataset to be opened. If null or undefined, the function returns the default dataset associated with the crawler run.

    • options: StorageManagerOptionsexternaloptional

      Storage manager options.

    Returns Promise<Dataset<Data>>