Working with storages

The Actor class provides methods to work either with the default storages of the Actor, or with any other storage, named or unnamed.

Types of storages

There are three types of storages available to Actors.

First are datasets, which are append-only tables for storing the results of your Actors. You can open a dataset through the Actor.open_dataset method, and work with it through the resulting Dataset class instance.

Next there are key-value stores, which function as a read/write storage for storing file-like objects, typically the Actor state or binary results. You can open a key-value store through the Actor.open_key_value_store method, and work with it through the resulting KeyValueStore class instance.

Finally, there are request queues. These are queues into which you can put the URLs you want to scrape, and from which the Actor can dequeue them and process them. You can open a request queue through the Actor.open_request_queue method, and work with it through the resulting RequestQueue class instance.

Each Actor run has its default dataset, default key-value store and default request queue.

Local storage emulation

To be able to develop Actors locally, the storages that the Apify platform provides are emulated on the local filesystem.

The storage contents are loaded from and saved to the storage folder in the Actor's main folder. Each storage type is stored in its own subfolder, so for example datasets are stored in the storage/datasets folder.

Each storage is then stored in its own folder, named after the storage, or called default if it's the default storage. For example, a request queue with the name my-queue would be stored in storage/request_queues/my-queue.

Each dataset item, key-value store record, or request in a request queue is then stored in its own file in the storage folder. Dataset items and request queue requests are always JSON files, and key-value store records can be any file type, based on its content type. For example, the Actor input is typically stored in storage/key_value_stores/default/INPUT.json.

Local Actor run with remote storage

When developing locally, opening any storage will by default use local storage. To change this behavior and to use remote storage you have to use force_cloud=True argument in Actor.open_dataset, Actor.open_request_queue or Actor.open_key_value_store. Proper use of this argument allows you to work with both local and remote storages.

Calling another remote Actor and accessing its default storage is typical use-case for using force-cloud=True argument to open remote Actor's storages.

Local storage persistence

By default, the storage contents are persisted across multiple Actor runs. To clean up the Actor storages before the running the Actor, use the --purge flag of the apify run command of the Apify CLI.

apify run --purge

Convenience methods for working with default storages

There are several methods for directly working with the default key-value store or default dataset of the Actor.

Actor.get_value('my-record') reads a record from the default key-value store of the Actor.
Actor.set_value('my-record', 'my-value') saves a new value to the record in the default key-value store.
Actor.get_input reads the Actor input from the default key-value store of the Actor.
Actor.push_data([{'result': 'Hello, world!'}, ...]) saves results to the default dataset of the Actor.

Opening named and unnamed storages

The Actor.open_dataset, Actor.open_key_value_store and Actor.open_request_queue methods can be used to open any storage for reading and writing. You can either use them without arguments to open the default storages, or you can pass a storage ID or name to open another storage.

from apify import Actor, Request


async def main() -> None:
    async with Actor:
        # Work with the default dataset of the Actor
        dataset = await Actor.open_dataset()
        await dataset.push_data({'result': 'Hello, world!'})

        # Work with the key-value store with ID 'mIJVZsRQrDQf4rUAf'
        key_value_store = await Actor.open_key_value_store(id='mIJVZsRQrDQf4rUAf')
        await key_value_store.set_value('record', 'Hello, world!')

        # Work with the request queue with the name 'my-queue'
        request_queue = await Actor.open_request_queue(name='my-queue')
        await request_queue.add_request(Request.from_url('https://apify.com'))

Deleting storages

To delete a storage, you can use the Dataset.drop, KeyValueStore.drop or RequestQueue.drop methods.

from apify import Actor


async def main() -> None:
    async with Actor:
        # Open a key-value store with the name 'my-cool-store'
        key_value_store = await Actor.open_key_value_store(name='my-cool-store')
        await key_value_store.set_value('record', 'Hello, world!')

        # Do something ...

        # Now we don't want it anymore
        await key_value_store.drop()

Working with datasets

In this section we will show you how to work with datasets.

Reading & writing items

To write data into a dataset, you can use the Dataset.push_data method.

To read data from a dataset, you can use the Dataset.get_data method.

To get an iterator of the data, you can use the Dataset.iterate_items method.

from apify import Actor


async def main() -> None:
    async with Actor:
        # Open a dataset and write some data in it
        dataset = await Actor.open_dataset(name='my-cool-dataset')
        await dataset.push_data([{'itemNo': i} for i in range(1000)])

        # Read back the first half of the data
        first_half = await dataset.get_data(limit=500)
        Actor.log.info(f'The first half of items = {first_half.items}')

        # Iterate over the second half
        second_half = [item async for item in dataset.iterate_items(offset=500)]
        Actor.log.info(f'The second half of items = {second_half}')

Exporting items

You can also export the dataset items into a key-value store, as either a CSV or a JSON record, using the Dataset.export_to_csv or Dataset.export_to_json method.

from apify import Actor


async def main() -> None:
    async with Actor:
        # Open a dataset and write some data in it
        dataset = await Actor.open_dataset(name='my-cool-dataset')
        await dataset.push_data([{'itemNo': i} for i in range(1000)])

        # Export the data as CSV
        await dataset.export_to(
            content_type='csv',
            key='data.csv',
            to_kvs_name='my-cool-key-value-store',
        )

        # Export the data as JSON
        await dataset.export_to(
            content_type='json',
            key='data.json',
            to_kvs_name='my-cool-key-value-store',
        )

        # Print the exported records
        store = await Actor.open_key_value_store(name='my-cool-key-value-store')

        csv_data = await store.get_value('data.csv')
        Actor.log.info(f'CSV data: {csv_data}')

        json_data = await store.get_value('data.json')
        Actor.log.info(f'JSON data: {json_data}')

Working with key-value stores

In this section we will show you how to work with key-value stores.

Reading and writing records

To read records from a key-value store, you can use the KeyValueStore.get_value method.

To write records into a key-value store, you can use the KeyValueStore.set_value method. You can set the content type of a record with the content_type argument. To delete a record, set its value to None.

from apify import Actor


async def main() -> None:
    async with Actor:
        # Open a named key-value store
        kvs = await Actor.open_key_value_store(name='my-cool-key-value-store')

        # Write some data to it
        await kvs.set_value('automatic_text', 'abcd')
        await kvs.set_value('automatic_json', {'ab': 'cd'})
        await kvs.set_value('explicit_csv', 'a,b\nc,d', content_type='text/csv')

        # Get the values and log them
        automatic_text = await kvs.get_value('automatic_text')
        Actor.log.info(f'Automatic text: {automatic_text}')

        automatic_json = await kvs.get_value('automatic_json')
        Actor.log.info(f'Automatic JSON: {automatic_json}')

        explicit_csv = await kvs.get_value('explicit_csv')
        Actor.log.info(f'Explicit CSV: {explicit_csv}')

        # Delete the `automatic_text` value
        await kvs.set_value('automatic_text', None)

Iterating keys

To get an iterator of the key-value store record keys, you can use the KeyValueStore.iterate_keys method.

from apify import Actor


async def main() -> None:
    async with Actor:
        # Open a named key-value store
        kvs = await Actor.open_key_value_store(name='my-cool-key-value-store')

        # Write some data to it
        await kvs.set_value('automatic_text', 'abcd')
        await kvs.set_value('automatic_json', {'ab': 'cd'})
        await kvs.set_value('explicit_csv', 'a,b\nc,d', content_type='text/csv')

        # Print the info for each record
        Actor.log.info('Records in store:')

        async for key, info in kvs.iterate_keys():
            Actor.log.info(f'key={key}, info={info}')

Public URLs of records

To get a publicly accessible URL of a key-value store record, you can use the KeyValueStore.get_public_url method.

from apify import Actor


async def main() -> None:
    async with Actor:
        # Open a named key-value store
        store = await Actor.open_key_value_store(name='my-cool-key-value-store')

        # Get the public URL of a record
        my_record_url = await store.get_public_url('my_record')
        Actor.log.info(f'URL of "my_record": {my_record_url}')

Working with request queues

In this section we will show you how to work with request queues.

Adding requests to a queue

To add a request into the queue, you can use the RequestQueue.add_request method.

You can use the forefront boolean argument to specify whether the request should go to the beginning of the queue, or to the end.

You can use the unique_key of the request to uniquely identify a request. If you try to add more requests with the same unique key, only the first one will be added.

Check out the Request for more information on how to create requests and what properties they have.

Reading requests

To fetch the next request from the queue for processing, you can use the RequestQueue.fetch_next_request method.

To get info about a specific request from the queue, you can use the RequestQueue.get_request method.

Handling requests

To mark a request as handled, you can use the RequestQueue.mark_request_as_handled method.

To mark a request as not handled, so that it gets retried, you can use the RequestQueue.reclaim_request method.

To check if all the requests in the queue are handled, you can use the RequestQueue.is_finished method.

Full example

import asyncio
import random

from apify import Actor, Request

FAILURE_RATE = 0.3


async def main() -> None:
    async with Actor:
        # Open the queue
        queue = await Actor.open_request_queue()

        # Add some requests to the queue
        for i in range(1, 10):
            await queue.add_request(Request.from_url(f'http://example.com/{i}'))

        # Add a request to the start of the queue, for priority processing
        await queue.add_request(Request.from_url('http://example.com/0'), forefront=True)

        # If you try to add an existing request again, it will not do anything
        add_request_info = await queue.add_request(
            Request.from_url('http://different-example.com/5')
        )
        Actor.log.info(f'Add request info: {add_request_info}')

        processed_request = await queue.get_request(add_request_info.id)
        Actor.log.info(f'Processed request: {processed_request}')

        # Finally, process the queue until all requests are handled
        while not await queue.is_finished():
            # Fetch the next unhandled request in the queue
            request = await queue.fetch_next_request()
            # This can happen due to the eventual consistency of the underlying request
            # queue storage, best solution is just to sleep a bit.
            if request is None:
                await asyncio.sleep(1)
                continue

            Actor.log.info(f'Processing request {request.unique_key}...')
            Actor.log.info(f'Scraping URL {request.url}...')

            # Do some fake work, which fails 30% of the time
            await asyncio.sleep(1)
            if random.random() > FAILURE_RATE:
                # If processing the request was successful, mark it as handled
                Actor.log.info('Request successful.')
                await queue.mark_request_as_handled(request)
            else:
                # If processing the request was unsuccessful, reclaim it so it can be
                # processed again.
                Actor.log.warning('Request failed, will retry!')
                await queue.reclaim_request(request)

Types of storages​

Local storage emulation​

Local Actor run with remote storage​

Local storage persistence​

Convenience methods for working with default storages​

Opening named and unnamed storages​

Deleting storages​

Working with datasets​

Reading & writing items​

Exporting items​

Working with key-value stores​

Reading and writing records​

Iterating keys​

Public URLs of records​

Working with request queues​

Adding requests to a queue​

Reading requests​

Handling requests​

Full example​